Introduction

This study aims to understand the factors behind the support or rejection of transgender rights in Europe. For this purpose, we will use the Special Eurobarometer 493, which focuses on several people’s perceptions, attitudes and opinions of discrimination based on personal characteristics.

To understand this support, we will focus specifically on the variable of support for transgender people to change their official documents to match their inner gender identity (QC19). This variable is analysed through cross-country differences and to develop a predictive model for other countries. To do this, we selected individual-level variables from the Eurobarometer survey alongside country-level indicators from external sources. These variables capture key demographic, social, and political factors influencing attitudes towards transgender rights.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.1
## Warning: package 'ggplot2' was built under R version 4.4.1
## Warning: package 'tibble' was built under R version 4.4.1
## Warning: package 'tidyr' was built under R version 4.4.1
## Warning: package 'readr' was built under R version 4.4.1
## Warning: package 'purrr' was built under R version 4.4.1
## Warning: package 'dplyr' was built under R version 4.4.1
## Warning: package 'stringr' was built under R version 4.4.1
## Warning: package 'forcats' was built under R version 4.4.1
## Warning: package 'lubridate' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(haven)
## Warning: package 'haven' was built under R version 4.4.1
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.4.1
library(scales)
## Warning: package 'scales' was built under R version 4.4.1
## 
## Adjuntando el paquete: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(countries)
## Warning: package 'countries' was built under R version 4.4.3
library(showtext)
## Warning: package 'showtext' was built under R version 4.4.1
## Cargando paquete requerido: sysfonts
## Warning: package 'sysfonts' was built under R version 4.4.1
## Cargando paquete requerido: showtextdb
## Warning: package 'showtextdb' was built under R version 4.4.1
library(mice)
## Warning: package 'mice' was built under R version 4.4.2
## 
## Adjuntando el paquete: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.4.1
library(ggrepel)  
## Warning: package 'ggrepel' was built under R version 4.4.1
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.2
## 
## Adjuntando el paquete: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(rworldmap)
## Warning: package 'rworldmap' was built under R version 4.4.2
## Cargando paquete requerido: sp
## Warning: package 'sp' was built under R version 4.4.2
## ### Welcome to rworldmap ###
## For a short introduction type :   vignette('rworldmap')
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.1
## 
## Adjuntando el paquete: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(car)
## Warning: package 'car' was built under R version 4.4.1
## Cargando paquete requerido: carData
## Warning: package 'carData' was built under R version 4.4.1
## 
## Adjuntando el paquete: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(lme4)
## Warning: package 'lme4' was built under R version 4.4.1
## Cargando paquete requerido: Matrix
## 
## Adjuntando el paquete: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(ggplot2)
library(RColorBrewer)
library(caret)
## Warning: package 'caret' was built under R version 4.4.1
## Cargando paquete requerido: lattice
## Warning: package 'lattice' was built under R version 4.4.1
## 
## Adjuntando el paquete: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.2
library(stargazer)
## 
## Please cite as: 
## 
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

Data cleaning

Country-level variables

To analyse the effect of different country-level characteristics we have obtained data from the World Bank, ILGA Europe (Rainbow Map) and from the Trans Rights Indicator Project by Harvard University.

# Religiosity-----------------------------------------------
religiosity <- read_dta("data/replication_data.dta")

religiosity <- religiosity |> 
  group_by(country_name) |> 
  slice_max(year) |> 
  ungroup()

religiosity <- religiosity |> 
  select(religiosity_percent, country_name)

religiosity <- religiosity |> 
  mutate(country = country_name(country_name, to = "simple", 
                                verbose = TRUE),
         country_iso = countrycode::countrycode(country, origin = "country.name",
                                                destination = "iso2c")) |> 
  select(country_iso, religiosity_percent)
## 
## In total 173 unique country names were provided
## 172/173 have been matched with EXACT matching
## 1/173 have been matched with FUZZY matching
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `country_iso = countrycode::countrycode(country, origin =
##   "country.name", destination = "iso2c")`.
## Caused by warning:
## ! Some values were not matched unambiguously: Kosovo
# Economic indicators---------------------------------------
economic_indicators <- read_xlsx("data/country_data.xlsx")

economic_indicators <- economic_indicators |> 
  drop_na(`Series Code`) 

economic_indicators <- economic_indicators %>%
  mutate(across(everything(), ~ str_remove_all(.x, "^\\.\\.$"))) %>%
  mutate(across(everything(), ~ if_else(.x == "", NA, .x)))

economic_indicators <- economic_indicators %>%
  mutate(latest_value = coalesce(`2023 [YR2023]`, 
                                 `2022 [YR2022]`, 
                                 `2021 [YR2021]`, 
                                 `2020 [YR2020]`)) |> 
  mutate(latest_value = as.numeric(latest_value))

economic_indicators <- economic_indicators |> 
  select(`Series Name`, `Country Name`, latest_value)

economic_indicators <- economic_indicators |> 
  pivot_wider(names_from = `Series Name`, values_from = latest_value )

economic_indicators <- economic_indicators |> 
  rename("country" = "Country Name", 
         "gini" = "Gini index",
         "gdp_pc" = "GDP per capita (constant 2015 US$)")

economic_indicators <- economic_indicators |> 
  mutate(country = country_name(country, to = "simple", 
                                verbose = TRUE),
         country_iso = countrycode::countrycode(country, 
                                                origin = "country.name",
                                                destination = "iso2c")) |> 
  select(country_iso, gini, gdp_pc)
## 
## In total 58 unique country names were provided
## 57/58 have been matched with EXACT matching
## 1/58 have been matched with FUZZY matching
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `country_iso = countrycode::countrycode(country, origin =
##   "country.name", destination = "iso2c")`.
## Caused by warning:
## ! Some values were not matched unambiguously: Channel Islands, Kosovo
economic_indicators <- economic_indicators |>
  add_row(country_iso = "MT", gini = 31.4, gdp_pc = 33000.6)
# Trans laws------------------------------------------------

trans <-  read_csv("data/trans.csv")
## New names:
## Rows: 51 Columns: 17
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (17): ...1, ...2, Legal gender recognition...3, Legal gender recognition...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
## • `Legal gender recognition` -> `Legal gender recognition...3`
## • `Legal gender recognition` -> `Legal gender recognition...4`
## • `Legal gender recognition` -> `Legal gender recognition...5`
## • `Legal gender recognition` -> `Legal gender recognition...6`
## • `Legal gender recognition` -> `Legal gender recognition...7`
## • `Legal gender recognition` -> `Legal gender recognition...8`
## • `Legal gender recognition` -> `Legal gender recognition...9`
## • `Legal gender recognition` -> `Legal gender recognition...10`
## • `Legal gender recognition` -> `Legal gender recognition...11`
## • `Legal gender recognition` -> `Legal gender recognition...12`
## • `Legal gender recognition` -> `Legal gender recognition...13`
## • `Legal gender recognition` -> `Legal gender recognition...14`
## • `Legal gender recognition` -> `Legal gender recognition...15`
## • `Legal gender recognition` -> `Legal gender recognition...16`
## • `Legal gender recognition` -> `Legal gender recognition...17`
trans <- trans |> 
  row_to_names(row_number = 1)
## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().
trans <- trans |> 
  clean_names() |> 
  select(2, 8) |> 
  drop_na(na_2) |> 
  rename("country" = "na_2")

trans <- trans |> 
  mutate(country = country_name(country, to = "simple", 
                                verbose = TRUE),
         country_iso = countrycode::countrycode(country, origin = "country.name",
                                                destination = "iso2c")) |> 
  select(country_iso, self_determination)
## 
## In total 49 unique country names were provided
## 49/49 have been matched with EXACT matching
## 0/49 have been matched with FUZZY matching
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `country_iso = countrycode::countrycode(country, origin =
##   "country.name", destination = "iso2c")`.
## Caused by warning:
## ! Some values were not matched unambiguously: Kosovo
# Rainbow score---------------------------------------------
rainbow <- read_csv("data/rainbow.csv")
## New names:
## Rows: 51 Columns: 78
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (77): ...1, ...2, Equality & non-discrimination...4, Equality & non-disc... num
## (1): RANKING
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...4`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...5`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...6`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...7`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...8`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...9`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...10`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...11`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...12`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...13`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...14`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...15`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...16`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...17`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...18`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...19`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...20`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...21`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...22`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...23`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...24`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...25`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...26`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...27`
## • `Equality & non-discrimination` -> `Equality & non-discrimination...28`
## • `Family` -> `Family...29`
## • `Family` -> `Family...30`
## • `Family` -> `Family...31`
## • `Family` -> `Family...32`
## • `Family` -> `Family...33`
## • `Family` -> `Family...34`
## • `Family` -> `Family...35`
## • `Family` -> `Family...36`
## • `Family` -> `Family...37`
## • `Family` -> `Family...38`
## • `Family` -> `Family...39`
## • `Hate crime & hate speech` -> `Hate crime & hate speech...40`
## • `Hate crime & hate speech` -> `Hate crime & hate speech...41`
## • `Hate crime & hate speech` -> `Hate crime & hate speech...42`
## • `Hate crime & hate speech` -> `Hate crime & hate speech...43`
## • `Hate crime & hate speech` -> `Hate crime & hate speech...44`
## • `Hate crime & hate speech` -> `Hate crime & hate speech...45`
## • `Hate crime & hate speech` -> `Hate crime & hate speech...46`
## • `Hate crime & hate speech` -> `Hate crime & hate speech...47`
## • `Legal gender recognition` -> `Legal gender recognition...48`
## • `Legal gender recognition` -> `Legal gender recognition...49`
## • `Legal gender recognition` -> `Legal gender recognition...50`
## • `Legal gender recognition` -> `Legal gender recognition...51`
## • `Legal gender recognition` -> `Legal gender recognition...52`
## • `Legal gender recognition` -> `Legal gender recognition...53`
## • `Legal gender recognition` -> `Legal gender recognition...54`
## • `Legal gender recognition` -> `Legal gender recognition...55`
## • `Legal gender recognition` -> `Legal gender recognition...56`
## • `Legal gender recognition` -> `Legal gender recognition...57`
## • `Legal gender recognition` -> `Legal gender recognition...58`
## • `Legal gender recognition` -> `Legal gender recognition...59`
## • `Legal gender recognition` -> `Legal gender recognition...60`
## • `Legal gender recognition` -> `Legal gender recognition...61`
## • `Legal gender recognition` -> `Legal gender recognition...62`
## • `Intersex bodily integrity` -> `Intersex bodily integrity...63`
## • `Intersex bodily integrity` -> `Intersex bodily integrity...64`
## • `Intersex bodily integrity` -> `Intersex bodily integrity...65`
## • `Intersex bodily integrity` -> `Intersex bodily integrity...66`
## • `Civil society space` -> `Civil society space...67`
## • `Civil society space` -> `Civil society space...68`
## • `Civil society space` -> `Civil society space...69`
## • `Civil society space` -> `Civil society space...70`
## • `Civil society space` -> `Civil society space...71`
## • `Civil society space` -> `Civil society space...72`
## • `Asylum` -> `Asylum...73`
## • `Asylum` -> `Asylum...74`
## • `Asylum` -> `Asylum...75`
## • `Asylum` -> `Asylum...76`
## • `Asylum` -> `Asylum...77`
## • `Asylum` -> `Asylum...78`
rainbow <- rainbow |> 
  row_to_names(row_number = 1) |> 
  clean_names() |> 
  select(2, 3) |> 
  drop_na(na_2) |> 
  rename("country" = "na_2",
         "rain_ind" = "na_3") |> 
  mutate(rain_ind = as.numeric(rain_ind) / 100)
## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().
rainbow <- rainbow |> 
  mutate(country = country_name(country, to = "simple", 
                                verbose = TRUE),
         country_iso = countrycode::countrycode(country, origin = "country.name",
                                                destination = "iso2c")) |> 
  select(country_iso, rain_ind)
## 
## In total 49 unique country names were provided
## 49/49 have been matched with EXACT matching
## 0/49 have been matched with FUZZY matching
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `country_iso = countrycode::countrycode(country, origin =
##   "country.name", destination = "iso2c")`.
## Caused by warning:
## ! Some values were not matched unambiguously: Kosovo

Individual level variables from the survey

In this section we will select the individual level variables included in the survey that could be relevant for the analysis.

To avoid possible errors in the subsequent models and to facilitate the interpretation and analysis of the variables, we carried out an initial recoding of the variables we selected.

survey <- read_dta("data/ZA7575.dta")

survey <- survey |> 
  select(caseid, serialid, isocntry, d10, d11, 
         d1, sd3, d7, sd1_4, sd1_7, sd1_8, d70,
         d63, qc19, d15a, d60, qc13_11, qc6_10, qc17_4)

# Recode dependent variable 
filtered_survey <- survey |> 
  mutate(trans_doc = case_when(
    qc19==1 ~ 1,
    qc19==2 ~ 0,
    qc19==3 ~ NA)) 

# Recode gender
filtered_survey <- filtered_survey |> 
  mutate(female = case_when(
    d10 == 1 ~ 0,
    d10 == 2 ~ 1,
    TRUE ~ NA),
    female = factor(female))

# Recode age
filtered_survey <- filtered_survey |> 
  mutate(age = if_else(d11 == 99 , NA, d11))

# Current occupation
filtered_survey <- filtered_survey |> 
  mutate(occupation = case_when(
    d15a %in% c(1, 3) ~ "Unemployed",
    d15a == 2 ~ "Student",
    d15a == 4 ~ "Retired",
    d15a <= 9 ~ "Self employed",
    d15a <=  18 ~ "Employed" ,
    TRUE ~ NA), 
    occupation = factor(occupation))

# Recode religion 
filtered_survey <- filtered_survey |> 
  mutate(religion = case_when(
    sd3 %in% c(1, 3, 4) ~ "Christians", 
    sd3 == 2 ~ "Orthodox Chrsitian", 
    sd3 %in% c(6, 7, 8) ~ "Muslims", 
    sd3 %in% c(5, 9, 10, 11, 14) ~ "Other religions", 
    sd3 %in% c(12, 13) ~ "Not religious", 
    TRUE ~ NA),
    religion = factor(religion))

# Marital status
filtered_survey <- filtered_survey |> 
  mutate(marital_status = case_when(
    d7 <= 4 ~ "Married",
    d7 <= 10 ~ "Single", 
    d7 <= 12 ~ "Divorced",
    d7 <=  14 ~ "Widowed",
    d7 == 15 ~ "Other", 
    TRUE ~ NA), 
    marital_status = factor(marital_status))

# Personal satisfaction
filtered_survey <- filtered_survey |> 
  mutate(personal_satis = if_else(d70 == 5, NA, d70),
         personal_satis = case_when(
           personal_satis <= 2 ~ "Satisfied",
           personal_satis > 2 ~ "Not satisfied"))

# Ideology
filtered_survey <- filtered_survey |> 
  mutate(ideology = if_else(d1 > 10, NA, d1)) 

# Contact LGBTQ+
filtered_survey <- filtered_survey |>  
  mutate(contact_lgbti = case_when(
    sd1_4 == 1 | sd1_7 == 1 | sd1_8 == 1 ~ "Contact", # There is contact
    sd1_4 == 2 | sd1_7 == 2 | sd1_8 == 2 ~ "No contact", # There is not contact
    TRUE ~ NA),
  contact_lgbti = factor(contact_lgbti))

# Self-perception of the social class
filtered_survey <- filtered_survey |> 
  mutate(social_class = case_when(
    d63 %in% c(1, 2) ~ "Working class", 
    d63 == 3 ~ "Middle class", 
    d63 %in% c(4, 5) ~ "High class", 
    TRUE ~ NA))

# Country names: DE-E and DE-W were problems 
filtered_survey <- filtered_survey |>  
  mutate(country = country_name(isocntry, to = "simple", 
                                verbose = TRUE, poor_matches = TRUE),
         isocntry = countrycode::countrycode(
           country, origin = "country.name", destination = "iso2c"))
## 
## In total 29 unique country names were provided
## 27/29 have been matched with EXACT matching
## 2/29 have been matched with FUZZY matching, out of which:
## 2/2 are a POOR match (likely wrongly identified)
## 
## 
## Multiple arguments have been matched to the same country name:
##   - DE-E : Germany 
##   - DE-W : Germany
## 
## The matching for the following countries is likely inaccurate:
## (Set - poor_matches - to FALSE to return NAs and - na_fill - to replace the NAs with the original name in - x)
##  - DE-E : Germany 
##  - DE-W : Germany
individual_data <- filtered_survey |> 
  select(caseid, serialid, isocntry, trans_doc, female, age, religion, 
         marital_status, personal_satis, ideology, 
         contact_lgbti, social_class, occupation)

Structural Cleaning and Final Merge

Before merging all data sets and proceeding with the analysis, we conducted an initial assessment of their characteristics to verify data consistency and ensure a reliable analytical process.

str(individual_data)
## tibble [27,438 × 13] (S3: tbl_df/tbl/data.frame)
##  $ caseid        : num [1:27438] 47 53 55 56 62 67 69 70 71 81 ...
##   ..- attr(*, "label")= chr "KANTAR CASE ID"
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ serialid      : num [1:27438] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "label")= chr "SERIAL CASE ID (PROVIDED BY KANTAR)"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ isocntry      : chr [1:27438] "BE" "BE" "BE" "BE" ...
##  $ trans_doc     : num [1:27438] 1 1 1 1 1 1 1 0 1 1 ...
##  $ female        : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 1 2 2 ...
##  $ age           : dbl+lbl [1:27438] 51, 62, 38, 29, 63, 41, 48, 88, 44, 45, 54, 34, 23, ...
##    ..@ label       : chr "AGE EXACT"
##    ..@ format.stata: chr "%8.0g"
##    ..@ labels      : Named num [1:2] 15 98
##    .. ..- attr(*, "names")= chr [1:2] "15 years" "98 years"
##  $ religion      : Factor w/ 5 levels "Christians","Muslims",..: 1 1 3 3 1 3 3 1 1 1 ...
##  $ marital_status: Factor w/ 5 levels "Divorced","Married",..: 1 2 2 2 2 2 2 1 2 2 ...
##  $ personal_satis: chr [1:27438] "Satisfied" "Satisfied" "Satisfied" "Satisfied" ...
##  $ ideology      : dbl+lbl [1:27438]  7,  8,  4, 10,  8,  7,  3,  9,  3,  8,  9,  1,  1, ...
##    ..@ label       : chr "LEFT-RIGHT PLACEMENT"
##    ..@ format.stata: chr "%8.0g"
##    ..@ labels      : Named num [1:12] 1 2 3 4 5 6 7 8 9 10 ...
##    .. ..- attr(*, "names")= chr [1:12] "Box 1 - left" "Box 2" "Box 3" "Box 4" ...
##  $ contact_lgbti : Factor w/ 2 levels "Contact","No contact": 2 2 1 1 2 2 2 2 1 1 ...
##  $ social_class  : chr [1:27438] "Working class" "Middle class" "Working class" "High class" ...
##  $ occupation    : Factor w/ 5 levels "Employed","Retired",..: 1 5 1 1 5 1 1 2 5 1 ...
str(religiosity)
## tibble [173 × 2] (S3: tbl_df/tbl/data.frame)
##  $ country_iso        : chr [1:173] "AF" "AL" "DZ" "AO" ...
##  $ religiosity_percent: num [1:173] 1 1 0.982 0.948 0.879 ...
##   ..- attr(*, "label")= chr "Religiosity (%)"
##   ..- attr(*, "format.stata")= chr "%10.0g"
str(economic_indicators)
## tibble [59 × 3] (S3: tbl_df/tbl/data.frame)
##  $ country_iso: chr [1:59] "AL" "AD" "AM" "AT" ...
##  $ gini       : num [1:59] 29.4 NA 27.9 30.7 NA 24.4 26.6 NA 39 NA ...
##  $ gdp_pc     : num [1:59] 5420 40227 5151 46339 5651 ...
str(trans)
## tibble [49 × 2] (S3: tbl_df/tbl/data.frame)
##  $ country_iso       : chr [1:49] "AL" "AD" "AM" "AT" ...
##  $ self_determination: chr [1:49] "0" "0" "0" "0" ...
str(rainbow)
## tibble [49 × 2] (S3: tbl_df/tbl/data.frame)
##  $ country_iso: chr [1:49] "AL" "AD" "AM" "AT" ...
##  $ rain_ind   : num [1:49] 36.38 43.23 9.16 49.63 2.25 ...

We observe some inconsistencies in two data sets: individual_data and religiosity. The problem with the variables in these data sets is that they contain attributes, some are labeled doubles (dbl+lbl), some are not proper factors, etc. This will make working with the data, imputing it, making models as well as making predictions much harder. Hence, it is highly important to clean the data set structurally before continuing.

We first proceed with the individual_data data set:

individual_data <- individual_data %>%
  mutate(
    
    # labelled numerics => plain numeric
    across(where(~ inherits(., "haven_labelled")), ~ as.numeric(.)),
    
    # character variables => factors
    isocntry = as.factor(isocntry),
    personal_satis = as.factor(personal_satis),
    social_class = as.factor(social_class),

    # factor variables => proper factors
    trans_doc = factor(trans_doc),
    female = factor(female),
    contact_lgbti = factor(contact_lgbti),
    religion = factor(religion),
    marital_status = factor(marital_status),
    occupation = factor(occupation),
    
    # dbl+lbl => plain numeric 
    ideology = as.numeric(ideology),  
    age = as.numeric(age),
    
    # numeric variables => plain numeric
    caseid = as.numeric(caseid), 
    serialid = as.numeric(serialid),
    ) %>%
  
  # attributes 
  mutate(across(everything(), ~ {
    attr(., "label") <- NULL
    attr(., "format.stata") <- NULL
    attr(., "labels") <- NULL
    return(.)
  }))

str(individual_data)
## tibble [27,438 × 13] (S3: tbl_df/tbl/data.frame)
##  $ caseid        : num [1:27438] 47 53 55 56 62 67 69 70 71 81 ...
##  $ serialid      : num [1:27438] 1 2 3 4 5 6 7 8 9 10 ...
##  $ isocntry      : Factor w/ 28 levels "AT","BE","BG",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ trans_doc     : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
##  $ female        : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 1 2 2 ...
##  $ age           : num [1:27438] 51 62 38 29 63 41 48 88 44 45 ...
##  $ religion      : Factor w/ 5 levels "Christians","Muslims",..: 1 1 3 3 1 3 3 1 1 1 ...
##  $ marital_status: Factor w/ 5 levels "Divorced","Married",..: 1 2 2 2 2 2 2 1 2 2 ...
##  $ personal_satis: Factor w/ 2 levels "Not satisfied",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ideology      : num [1:27438] 7 8 4 10 8 7 3 9 3 8 ...
##  $ contact_lgbti : Factor w/ 2 levels "Contact","No contact": 2 2 1 1 2 2 2 2 1 1 ...
##  $ social_class  : Factor w/ 3 levels "High class","Middle class",..: 3 2 3 1 1 2 2 2 2 1 ...
##  $ occupation    : Factor w/ 5 levels "Employed","Retired",..: 1 5 1 1 5 1 1 2 5 1 ...

And then we repeat the process for the religiosity data set:

religiosity <- religiosity |> 
  mutate(across(everything(), ~{
    attr(.,"label") <- NULL
    attr(., "format.stata") <- NULL
    return(.)
  }))

Now, we can finally merge all the datasets using left_join:

# Final merge
data_descriptive <- individual_data |> 
  left_join(rainbow, by = c("isocntry" = "country_iso")) |> 
  left_join(trans, by = c("isocntry" = "country_iso")) |> 
  left_join(economic_indicators, by = c("isocntry" = "country_iso")) |> 
  left_join(religiosity, by = c("isocntry" = "country_iso"))

Other interesting variables from the survey

In the survey used for this project there are several questions that deal with issues related to discrimination towards different minorities, as well as support for their rights and freedoms. In order to complete the analysis of support for transgender people, we thought it would be appropriate to add more variables dealing with this issue and not only the question on gender reassignment in official documents. These variables will not be included in the models later.

other_variables <- filtered_survey |> 
  select(isocntry, qc13_11, qc6_10, qc17_4)

# How comfortable you would feel if one of your children was in love with a trans person
other_variables <- other_variables |> 
  mutate(child_love = if_else(qc13_11 > 10, NA, qc13_11))

# Transgender person in a high political position
other_variables <- other_variables |>  
  mutate(trans_pol = if_else(qc6_10 > 10, NA, qc6_10))

# Information in schools about being transgender 
other_variables <- other_variables |> 
  mutate(trans_school = case_when(
    qc17_4 == 1 ~ "Totally agree",
    qc17_4 == 2 ~ "Tend to agree",
    qc17_4 == 3 ~ "Tend to disagree",
    qc17_4 == 4 ~ "Totally disagree",
    TRUE ~ NA),
    trans_school = factor(trans_school))

other_variables <- other_variables |> 
  mutate(country = country_name(isocntry, to = "simple", 
                                verbose = TRUE, poor_matches = TRUE),
         isocntry = countrycode::countrycode(
           country, origin = "country.name", destination = "iso2c"))
## 
## In total 28 unique country names were provided
## 28/28 have been matched with EXACT matching
## 0/28 have been matched with FUZZY matching
other_variables <- other_variables |> 
  select(child_love, trans_pol, trans_school, isocntry)

Descriptive analysis

Individual descriptive analysis

plot_intro(data_descriptive)

Before diving into an in-depth analysis of the dataset, we first examine its key characteristics using this plot_intro. We observe that 55.6% of the dataset consists of discrete variables, while 44.4% are continuous. Additionally, 71.3% of the rows are complete, meaning they contain no missing values in any variable. The total percentage of missing values is minimal, accounting for just 2.1%.

str(data_descriptive)
## tibble [27,438 × 18] (S3: tbl_df/tbl/data.frame)
##  $ caseid             : num [1:27438] 47 53 55 56 62 67 69 70 71 81 ...
##  $ serialid           : num [1:27438] 1 2 3 4 5 6 7 8 9 10 ...
##  $ isocntry           : chr [1:27438] "BE" "BE" "BE" "BE" ...
##  $ trans_doc          : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
##  $ female             : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 1 2 2 ...
##  $ age                : num [1:27438] 51 62 38 29 63 41 48 88 44 45 ...
##  $ religion           : Factor w/ 5 levels "Christians","Muslims",..: 1 1 3 3 1 3 3 1 1 1 ...
##  $ marital_status     : Factor w/ 5 levels "Divorced","Married",..: 1 2 2 2 2 2 2 1 2 2 ...
##  $ personal_satis     : Factor w/ 2 levels "Not satisfied",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ideology           : num [1:27438] 7 8 4 10 8 7 3 9 3 8 ...
##  $ contact_lgbti      : Factor w/ 2 levels "Contact","No contact": 2 2 1 1 2 2 2 2 1 1 ...
##  $ social_class       : Factor w/ 3 levels "High class","Middle class",..: 3 2 3 1 1 2 2 2 2 1 ...
##  $ occupation         : Factor w/ 5 levels "Employed","Retired",..: 1 5 1 1 5 1 1 2 5 1 ...
##  $ rain_ind           : num [1:27438] 78.5 78.5 78.5 78.5 78.5 ...
##  $ self_determination : chr [1:27438] "1" "1" "1" "1" ...
##  $ gini               : num [1:27438] 26.6 26.6 26.6 26.6 26.6 26.6 26.6 26.6 26.6 26.6 ...
##  $ gdp_pc             : num [1:27438] 44731 44731 44731 44731 44731 ...
##  $ religiosity_percent: num [1:27438] 0.689 0.689 0.689 0.689 0.689 ...
summary(data_descriptive)
##      caseid           serialid       isocntry         trans_doc    female   
##  Min.   :      1   Min.   :    1   Length:27438       0   : 9695   0:12492  
##  1st Qu.:    731   1st Qu.: 6860   Class :character   1   :14463   1:14946  
##  Median :   1436   Median :13720   Mode  :character   NA's: 3280            
##  Mean   : 202789   Mean   :13720                                            
##  3rd Qu.:   2322   3rd Qu.:20579                                            
##  Max.   :9999999   Max.   :27438                                            
##                                                                             
##       age                      religion      marital_status 
##  Min.   :15.00   Christians        :15412   Divorced: 2243  
##  1st Qu.:37.00   Muslims           :  392   Married :14673  
##  Median :53.00   Not religious     : 5803   Other   :  107  
##  Mean   :51.56   Orthodox Chrsitian: 4016   Single  : 7635  
##  3rd Qu.:66.00   Other religions   : 1332   Widowed : 2712  
##  Max.   :98.00   NA's              :  483   NA's    :   68  
##                                                             
##        personal_satis     ideology         contact_lgbti  
##  Not satisfied: 4738   Min.   : 1.000   Contact   :11422  
##  Satisfied    :22598   1st Qu.: 4.000   No contact:15522  
##  NA's         :  102   Median : 5.000   NA's      :  494  
##                        Mean   : 5.309                     
##                        3rd Qu.: 7.000                     
##                        Max.   :10.000                     
##                        NA's   :4689                       
##         social_class           occupation       rain_ind     self_determination
##  High class   : 2046   Employed     :12330   Min.   :17.50   Length:27438      
##  Middle class :13068   Retired      : 8791   1st Qu.:29.51   Class :character  
##  Working class:11303   Self employed: 1979   Median :50.07   Mode  :character  
##  NA's         : 1021   Student      : 1676   Mean   :50.08                     
##                        Unemployed   : 2662   3rd Qu.:67.14                     
##                                              Max.   :87.84                     
##                                                                                
##       gini           gdp_pc       religiosity_percent
##  Min.   :24.10   Min.   :  9820   Min.   :0.2149     
##  1st Qu.:28.30   1st Qu.: 18686   1st Qu.:0.6886     
##  Median :31.30   Median : 28570   Median :0.8116     
##  Mean   :30.82   Mean   : 34549   Mean   :0.7800     
##  3rd Qu.:33.90   3rd Qu.: 45589   3rd Qu.:0.9194     
##  Max.   :39.00   Max.   :106343   Max.   :1.0000     
## 

The dataset includes both individual-level variables, which capture personal characteristics and opinions, and country-level contextual variables, which help explain cross-country differences in support for transgender rights.

Individual-Level Variables:

These variables reflect personal attributes, experiences, and attitudes:

  • serialid: Unique identifier for each participant.

  • isocntry: Categorical variable distinguishing respondents by country (28 EU member states).

  • trans_doc: The key variable of this study, measuring attitudes toward the right of transgender individuals to officially change their name and gender. It is binary, with values of 0 (opposition) and 1 (support). In this dataset, 59.9% of participants expressed support.

  • female: 54.5% of the sample are women (14,946), while 45.5% are men (12,492).

  • age: Ranges from 15 to 98 years, with a median age of 53. The youngest 25% of respondents are aged 37 or younger, while the oldest 25% are 66 or older. This indicates a relatively older sample, which is relevant for analyzing attitudes, as previous studies have highlighted generational differences in the acceptance of LGBTI rights.

  • religion: captures respondents’ religious affiliations, showing that a majority identify as Christian, with a smaller proportion reporting no religious affiliation.

  • marital_status: reflects participants’ relationship status, which may provide insight into how family experiences shape attitudes toward the transgender community.

  • personal_satis: Categorical variable representing personal life satisfaction, which could be linked to more open or restrictive attitudes.

  • ideology: A key political scale for analyzing ideological influence. Measured on a 1 to 10 scale, with a median of 5 and a mean of 5.3, suggesting a distribution centered around moderate positions.

  • contact_lgbti: assesses prior exposure to LGBTI individuals, an important factor in reducing prejudice, with a division between those who have had contact and those who have not.

  • social_class: Self-perceived social class, indicating the respondent’s socioeconomic context.

  • occupation: Employment status, allowing us to explore the influence of workplace environments.

Country-Level Contextual Variables:

In addition to individual-level factors, we incorporate aggregated national indicators to better understand how structural and policy differences influence support for transgender rights.

  • rain_ind (Rainbow Index): Ranges from 17.5 %to 87.84%, with a mean of 50.08%. This index reflects the variation in LGBTI rights and protections across countries.

  • self_determination: Legal existence of gender self-determination. A qualitative variable that could influence support for transgender rights.

  • gini: Measures economic inequality at the national level and helps assess its impact on attitudes. Ranges from 24.1% to 39%, with a mean of 30.8%.

  • gdp_pc (GDP per capita): Examines the relationship between economic development and attitudes.Ranges from $9,820 to $106,343, with a mean of $34,577. The units are constant prices of 2015 in dollars to allow us to account for the inflation.

  • religiosity_percent: Percentage of the population that identifies as religious, which is crucial for exploring cultural and religious influences on attitudes.Values range from 21.49% to 100%, with a median of 81.16%.

corr_data <- data_descriptive |> 
  select(trans_doc, age, ideology, rain_ind, gini, gdp_pc, religiosity_percent) |> 
  mutate(trans_doc = as.numeric(trans_doc))

corr_matrix <- cor(corr_data, use = "complete.obs")

plot_correlation(corr_matrix)

This correlation matrix highlights that support for transgender rights is significantly shaped by economic, cultural, and policy-related factors. Higher economic development, lower inequality, and strong LGBTI legal protections are associated with greater acceptance, while religiosity, conservatism, and age tend to correlate with lower support.

Distribution of the individual variables

First of all, we proceed to analyze the distribution of the individual variables in our dataset:

# Religion
g1 <- ggplot(data_descriptive, aes(x = religion)) +
  geom_bar(aes(y = ..prop.., group = 1), fill = "#9C9EDE", color = "black") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Religious identification", x = "Religion", y = "Percentage") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 12, face = "bold")) +
  theme_minimal()

# Marital status
g2 <- ggplot(data_descriptive, aes(x = marital_status)) +
  geom_bar(aes(y = ..prop.., group = 1), fill = "#98DF8A", color = "black") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Marital status distribution", x = "Status", y = "Percentage") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 12, face = "bold")) +
  theme_minimal()

# Personal satisfaction
g3 <- ggplot(data_descriptive, aes(x = personal_satis)) +
  geom_bar(aes(y = ..prop.., group = 1), fill = "#FF9896", color = "black") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Personal Satisfaction", 
       x = "Response", y = "Percentage") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 12, face = "bold")) +
  theme_minimal()

# Social class
g4 <- ggplot(data_descriptive, aes(x = social_class)) +
  geom_bar(aes(y = ..prop.., group = 1), fill = "#9EDAE5", color = "black") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Self-reported social slass", x = "Social class", y = "Percentage") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 12, face = "bold")) +
  theme_minimal()

#Occupation

g5 <- ggplot(data_descriptive, aes(x = occupation)) +
  geom_bar(aes(y = ..prop.., group = 1), fill = "yellow", color = "black") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Occupation distribution", x = "Occupation", y = "Percentage") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 12, face = "bold")) +
  theme_minimal()

# Contact with LGTB community

g6 <- ggplot(data_descriptive, aes(x = contact_lgbti)) +
  geom_bar(aes(y = ..prop.., group = 1), fill = "darkblue", color = "black") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Contact with the LGTBI community", x = "Response", y = "Percentage") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 12, face = "bold")) +
  theme_minimal()

# Graphs combined
g <- (g1 | g2 | g3) / (g4 | g5 | g6) 
g
## Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

One of the first aspects to analyze is the distribution of categorical variables in the dataset, which, as previously noted, account for 55.6% of the total data. The set of bar charts provides an overview of key individual characteristics, highlighting patterns in religious identification, marital status, personal satisfaction, social class, occupation, and contact with the LGBTI community.

The majority of respondents identify as Christian, while a notable percentage report no religious affiliation, and smaller proportions belong to other religious groups such as Muslims and Orthodox Christians. In terms of marital status, the most represented group consists of married individuals, followed by single respondents, while divorced and widowed participants form smaller segments of the sample.

Regarding personal satisfaction, a significant majority report being satisfied with their lives, whereas only a small minority express dissatisfaction. The self-reported social class distribution indicates that most respondents identify as middle or working class, while a much smaller portion considers themselves part of the high class. Similarly, the occupation distribution reveals that most individuals are either employed or retired, while students, the self-employed, and the unemployed represent smaller proportions.

Finally, contact with the LGBTI community reveals a notable divide, with a substantial portion of respondents reporting no contact with LGBTI individuals, while a significant number indicate prior interaction. This variable is particularly relevant, as previous research suggests that personal exposure to LGBTI individuals can influence attitudes toward their rights.

ggplot(data_descriptive, aes(x = age)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "steelblue", alpha = 0.6) +
  geom_density(color = "red", linewidth = 1) +
  theme_minimal() +
  labs(title = "Age Distribution", x = "Age", y = "Density")

Next, we observe the age distribution of the respondents in the dataset. The distribution appears right-skewed, indicating that there are fewer younger participants and a greater proportion of middle-aged and older respondents. The density increases gradually, peaking around 50 to 70 years old, before sharply declining beyond 75 years.The smoother density curve (red line) confirms the overall trend, highlighting multiple local peaks, particularly around 30, 50, and 70 years old, before declining toward the upper age limit of the sample.

Overall, the distribution supports the earlier summary statistics, where the median age was reported at 53 years, with the youngest quartile at 37 years or younger and the oldest quartile at 66 years or older. The relatively low representation of younger individuals could influence the overall attitudinal patterns observed in the study.

Variable Distribution by Country

Next, we analyze the contextual differences between the countries surveyed in the dataset.

# Contact with LGTBI community by country
country_order <- data_descriptive %>%
  filter(!is.na(contact_lgbti)) %>%
  group_by(isocntry) %>%
  summarise(no_contact_prop = mean(contact_lgbti == "No contact")) %>%
  arrange(desc(no_contact_prop))


ggplot(data_descriptive, aes(x = factor(isocntry, levels = country_order$isocntry), fill = contact_lgbti)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  labs(title = "Contact with LGTBI community by country", 
       x = "Country", y = "Percentage") +
  scale_fill_manual(values = c("Contact" = "olivedrab3", "No contact" = "red2")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

data_country <- data_descriptive |> 
  filter(!is.na(contact_lgbti)) |> 
  group_by(isocntry) |> 
  summarise(no_contact_prop = mean(contact_lgbti == "No contact")) |> 
  mutate(contact_category = ifelse(no_contact_prop < 0.5, "Majority Contact", "Majority No Contact"))

map_data <- joinCountryData2Map(data_country, joinCode = "ISO2", nameJoinColumn = "isocntry")
## 28 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 215 codes from the map weren't represented in your data
color_palette <- c("Majority Contact" = "olivedrab3", "Majority No Contact" = "red2")

mapCountryData(map_data, 
               nameColumnToPlot = "contact_category", 
               mapTitle = "Predominance of Contact with the LGBTI Community by Country",
               catMethod = "categorical",
               colourPalette = color_palette,
               mapRegion = "Europe")

These two visualizations illustrate the distribution of contact with the LGBTI community across countries, highlighting significant regional differences in social exposure to LGBTI individuals.

The first bar chart presents the proportion of respondents in each country who report having contact with LGBTI individuals (green) versus those who do not (red), with a small portion of missing responses (gray). Countries are arranged from left to right based on decreasing levels of respondents lacking contact with the LGBTI community. Romania, Bulgaria, Poland, and several other Eastern European countries show the highest levels of non-contact, while the Netherlands, Sweden, and other Western and Northern European nations report the highest levels of contact. This suggests a strong regional divide in familiarity with LGBTI individuals.

The second map visually reinforces this divide, categorizing countries based on whether a majority of respondents reported having contact with the LGBTI community. Western and Northern European countries (in green) generally show higher exposure with the exception of Portugal, while Eastern and Southern European countries (in red) report lower levels of contact.

western_northern <- c("AT", "BE", "DK", "FI", "FR", "DE", "IE", "IS", "LU", "NL", "NO", "SE", "CH", "ES", "PT", "IT", "GB", "MT")
eastern <- c("BG", "CZ", "EE", "HU", "LT", "LV", "PL", "RO", "RU", "SK", "SI", "HR", "UA", "GR", "CY")

data_descriptive$region <- case_when(
  data_descriptive$isocntry %in% western_northern ~ "Western/Northern",
  data_descriptive$isocntry %in% eastern ~ "Eastern",
  TRUE ~ "Other"
)

ggplot(data_descriptive, aes(x = gdp_pc, y = rain_ind, label = isocntry)) +
  geom_point(aes(color = region), size = 4, shape = 21, stroke = 1, fill = "white") +  
  geom_text(size = 3, hjust = 1.6)  + 
  theme_minimal() +
  labs(
    title = "GDP per capita vs Rainbow Indicator per Country",
    x = "GDP per capita",
    y = "Rainbow Indicator",
    color = "Region"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    legend.position = "top")

ggplot(data_descriptive, aes(x = religiosity_percent, y = rain_ind, label = isocntry)) +
  geom_point(aes(color = case_when(
    isocntry %in% western_northern ~ "Western/Northern",
    isocntry %in% eastern ~ "Eastern",
    TRUE ~ "Other")), size = 4, shape = 21, stroke = 1, fill = "white") +  
  geom_text(size = 3, hjust = -0.6) +  
  scale_color_manual(values = c("Western/Northern" = "royalblue", "Eastern" = "red3", "Other" = "gray")) + 
  theme_minimal(base_size = 14) +
  labs(
    title = "Religiosity Percentage vs Rainbow Indicator per Country",
    x = "Religiosity Percentage",
    y = "Rainbow Indicator",
    color = "Region"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    legend.position = "top")

Both graphs explore the relationship between the Rainbow Indicator, which measures LGBTQ+ rights and inclusivity, and two different country lavel factors: GDP per capita in the first graph and Religiosity Percentage in the second. Countries are classified by region, with red representing Eastern Europe and light blue representing Western/Northern Europe.

In the first graph (GDP per capita vs Rainbow Indicator), a pattern emerges where countries with higher GDP per capita tend to have higher Rainbow Indicator scores but not very significant, suggesting that wealthier nations generally provide greater LGBTQ+ rights and protections. Western and Northern European countries, which are typically wealthier, cluster toward the upper right, indicating both strong economies and progressive LGBTQ+ policies. Conversely, many Eastern European countries, which have lower GDP per capita, also tend to have lower Rainbow Indicator scores, reflecting more restrictive policies on LGBTQ+ rights. However, there are exceptions, such as Greece (GR), which, despite being in Eastern Europe, appears to have a relatively progressive stance on LGBTQ+ rights compared to its neighbors or Italy, which despite being in Occident Europe has a surprisingly low Rainbow Indicator.

In the second graph, which plots Religiosity Percentage against the Rainbow Indicator, the relationship appears weaker and less clearly defined compared to the first graph. Two of the most notable outliers are Czechia (CZ), which has one of the lowest levels of religiosity in Eastern Europe yet does not rank particularly high on the Rainbow Indicator. This suggests that while secularization may play a role in advancing LGBTQ+ rights, it is not the only determining factor.

# Distribution of the Political Ideology by Country

ggplot(data_descriptive, aes(x = reorder(isocntry, ideology, FUN = median, na.rm = TRUE), y = ideology, fill = isocntry)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Distribution of the Political Ideology by Country", 
       x = "Country", y = "Ideology (1 = Left, 10 = Right)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +  
   scale_fill_manual(values = colorRampPalette(brewer.pal(9, "Blues"))(28))
## Warning: Removed 4689 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

This boxplot visualization illustrates the distribution of political ideology across countries, measured on a 1 to 10 scale, where 1 represents left-wing views and 10 represents right-wing views. The countries are arranged from left to right based on their median ideological positioning, with lighter shades representing more left-leaning countries and darker shades indicating more right-leaning ones.

Almost every Western and Northern European countries, such as Austria (AT), Belgium (BE), Denmark (DK), and Finland (FI), show a more center median ideology, with a broader distribution of responses. In contrast, Eastern and Southern European countries, including Slovakia (SK), Hungary (HU), Latvia (LV), and Romania (RO), display higher median scores, indicating a more right-leaning population.

However, Spain is particularly interesting because, although its median ideology is similar to that of many other countries, the overall distribution is skewed to the left, meaning a larger proportion of respondents identify as left-wing compared to right-wing. This suggests that while the central tendency remains balanced, there is a stronger leftist presence in the country.

In contrast, some countries like the Czech Republic (CZ), Cyprus (CY), and Estonia (EE) exhibit the opposite trend. Their median scores remain relatively moderate, yet the distribution skews towards the right, implying a larger proportion of right-leaning individuals compared to left-leaning ones.

Bivariate Analysis with Target Variable

data_country <- aggregate(as.numeric(trans_doc) ~ isocntry, data = data_descriptive, mean, na.rm = TRUE)

map_data <- joinCountryData2Map(data_country, joinCode = "ISO2", nameJoinColumn = "isocntry")
## 28 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 215 codes from the map weren't represented in your data
mapCountryData(map_data, 
               nameColumnToPlot = "as.numeric(trans_doc)", 
               mapTitle = "Support for change of Documents of the Trans Community per Country", 
               catMethod = "quantiles", 
               colourPalette = brewer.pal(9, "Blues"), 
               mapRegion = "Europe")
## Warning in rwmGetColours(colourPalette, numColours): 9 colours specified and 7
## required, using interpolation to calculate colours

This choropleth map illustrates the level of support for transgender individuals’ right to change their official documents, our target variable, across different European countries. The color gradient, ranging from light blue (low support) to dark blue (high support), visually represents the variation in attitudes toward this issue.

Western and Northern European countries, such as Spain, Portugal, the Netherlands, Belgium, and the Nordic countries, exhibit the highest levels of support, as indicated by the darkest shades of blue. This pattern aligns with broader trends of progressive LGBTI policies and strong legal protections in these regions. France and Germany show similarly high levels of support, though not as strong.

In contrast, Central and Eastern European countries, including Poland, Hungary, and several Balkan states, show significantly lower levels of support, as reflected by the lighter shades. These countries tend to have more conservative social attitudes and restrictive legal frameworks regarding transgender rights. Italy stands out with surprisingly low support compared to other Western European nations. This suggests that despite being a major Western European country, Italy remains more conservative on transgender issues.

ggplot(data_descriptive, aes(x = as.factor(trans_doc), y = ideology, fill = as.factor(trans_doc))) + 
  geom_boxplot(alpha = 0.6) + 
  theme_minimal() + 
  labs(title = "Political Ideology vs Support Trans Documents", 
       x = "Support for the change of Documents of the Trans Community", 
       y = "Ideology (1 = Left, 10 = Right)") + 
  scale_fill_manual(values = c("0" = "red2", "1" = "olivedrab3")) + 
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) + 
  theme(legend.position = "none")
## Warning: Removed 4689 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggplot(data_descriptive, aes(x = as.factor(trans_doc), y = religiosity_percent, fill = as.factor(trans_doc))) +
  geom_boxplot(alpha = 0.6) +
  theme_minimal() +
  labs(title = "Religiosity vs Support Trans Documents", 
       x = "Support for the change of Documents of the Trans Community", 
       y = "Percentage of Religiosity") + 
  scale_fill_manual(values = c("0" = "red2", "1" = "olivedrab3")) + 
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) + 
  theme(legend.position = "none")

These two boxplots examine the relationship between support for transgender individuals’ right to change their official documents and two factors: political ideology and religiosity.

In the first boxplot (Political Ideology vs. Support for Trans Documents), the median ideology score is similar across groups, meaning that those who support (green) and oppose (red) document changes have roughly the same central tendency. However, a key difference emerges in the distribution of responses: those who oppose the right to change documents have a longer right-leaning tail, indicating a higher presence of strongly right-wing individuals in this group. In contrast, supporters have a more centralized distribution, meaning they are not necessarily strongly left-wing, but rather clustered around moderate-left to centrist views. The NA category (gray) also has a similar median but a broader spread, covering both left- and right-leaning respondents more evenly.

The second boxplot (Religiosity vs. Support for Trans Documents) presents a slightly different pattern. Here, the median religiosity score is slightly higher for those who oppose document changes compared to supporters. However, unlike political ideology, the distribution of religiosity is more similar across groups, meaning that while opposition to trans rights is somewhat more frequent among more religious individuals, this is not highly significant

In summary, while there are some observable differences in political ideology and religiosity between those who support or oppose transgender rights, they are not highly significant. The distributions indicate that attitudes toward transgender document changes are influenced by multiple factors, with ideology and religiosity playing a role but not being the sole determinants.

Mice imputation

Before continuing to the models we take a look at missing values in the original Eurobarometer dataset and the possibility to impute them. We did the descriptive analysis without previous imputation as we perceived that descriptive makes only sense with the original dataset and the real values we know to be true.

First, we take a look at the NA distribution of the Eurobarometer dataset.

colSums(is.na(individual_data))
##         caseid       serialid       isocntry      trans_doc         female 
##              0              0              0           3280              0 
##            age       religion marital_status personal_satis       ideology 
##              0            483             68            102           4689 
##  contact_lgbti   social_class     occupation 
##            494           1021              0

The variables with most missing values are ideology, with 4689 NAs, which supposes a 17% of missing data within the variable (5689/27438), and social_class, with 1021 NAs, making up 21% of the variable (5883/27438). Religion has 483 NAs and contact_lgbti has 494 NAs. Trans_doc has 3280 NAs, but we will not impute it as it is our target variable. The variables marital_status and personal_satis have 68 and 102 NAs respectively. Due to the low number we can go ahead and simply remove the rows. We remove caseid and serialid as it could confuse our mice imputation.

mice_data <- individual_data %>% 
  drop_na(c(marital_status, personal_satis)) %>% 
  select(-c(caseid))

serialid <- mice_data$serialid

mice_data <- mice_data %>% 
  select(-c(serialid))

colSums(is.na(mice_data))
##       isocntry      trans_doc         female            age       religion 
##              0           3214              0              0            448 
## marital_status personal_satis       ideology  contact_lgbti   social_class 
##              0              0           4599            472            973 
##     occupation 
##              0

We are left with 4599 NAs for ideology, 472 NAs for contact_lgbti, 448 NAs for religion and 973 NAs for social class.

First, we are going to manually look at the different imputation methods and strategies to understand how close they come to the original dataset. As there are factor and numerical variables, the best options to test tje imputation are Random Forest and Pmm methods.

imputed_comparison <- data.frame(
  original = mice_data$ideology,
  imputed_rf = complete(mice(mice_data, method = "rf", 
                             m = 3, maxit = 3, 
                             seed = 123))$ideology,
  imputed_pmm = complete(mice(mice_data, method = "pmm",
                              m = 3, maxit = 3, 
                              seed = 123))$ideology)
## 
##  iter imp variable
##   1   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   1   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   1   3  trans_doc  religion  ideology  contact_lgbti  social_class
##   2   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   2   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   2   3  trans_doc  religion  ideology  contact_lgbti  social_class
##   3   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   3   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   3   3  trans_doc  religion  ideology  contact_lgbti  social_class
## 
##  iter imp variable
##   1   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   1   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   1   3  trans_doc  religion  ideology  contact_lgbti  social_class
##   2   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   2   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   2   3  trans_doc  religion  ideology  contact_lgbti  social_class
##   3   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   3   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   3   3  trans_doc  religion  ideology  contact_lgbti  social_class

The m and the maxit arguments are = 3 to reduce loading time without losing so much information.

The next step is to plot the results to compare the resulting distributions:

# Arguments for the plot 
methods <- c("original", "imputed_rf" , "imputed_pmm")
titles <- c("Distribution of the Age Variable",
           "Random Forest-imputed Distribution",
            "PMM-imputed Distribution")
colors_fill <- c( "#E7CB94", "#E7969C", "#DE9ED6")


# Long format 
data_imputed_long <- imputed_comparison |>
  pivot_longer(cols = all_of(methods), names_to = "method", 
                values_to = "value") |>
  mutate(title = factor(method, levels = methods, labels = titles))


# Distributions comparison 
plot_mice <- ggplot(data_imputed_long, aes(x = value, fill = title)) +
  geom_histogram(binwidth = 1, color = "black", position = "identity") +
  facet_wrap(~ title, scales = "free_y") +
  scale_fill_manual(values = colors_fill) +
  theme_classic() +
  theme(legend.position = "none")

plot_mice
## Warning: Removed 4599 rows containing non-finite outside the scale range
## (`stat_bin()`).

As can be seen, both methods present very similar distributions for the ideology variable compared to the original one. This means that both methods predict values correctly for the ideological location variable.

The next step will be to perform this imputation for the rest of the dataset. We will use m = 3, as our missing values range between 5% and 20% of the observations. Hence, m=3 is still powerful enough to have reasonable imputation and it will reduce the loading time.

init = mice(mice_data, m = 3, seed = 123)
## 
##  iter imp variable
##   1   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   1   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   1   3  trans_doc  religion  ideology  contact_lgbti  social_class
##   2   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   2   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   2   3  trans_doc  religion  ideology  contact_lgbti  social_class
##   3   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   3   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   3   3  trans_doc  religion  ideology  contact_lgbti  social_class
##   4   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   4   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   4   3  trans_doc  religion  ideology  contact_lgbti  social_class
##   5   1  trans_doc  religion  ideology  contact_lgbti  social_class
##   5   2  trans_doc  religion  ideology  contact_lgbti  social_class
##   5   3  trans_doc  religion  ideology  contact_lgbti  social_class

For our target variable trans_doc we set method=““, so that it is not imputed.

meth = init$method

meth
##       isocntry      trans_doc         female            age       religion 
##             ""       "logreg"             ""             ""      "polyreg" 
## marital_status personal_satis       ideology  contact_lgbti   social_class 
##             ""             ""          "pmm"       "logreg"      "polyreg" 
##     occupation 
##             ""
meth[c("trans_doc")]=""

imputed_data = mice(mice_data, method=meth, m=3, seed=123)
## 
##  iter imp variable
##   1   1  religion  ideology  contact_lgbti  social_class
##   1   2  religion  ideology  contact_lgbti  social_class
##   1   3  religion  ideology  contact_lgbti  social_class
##   2   1  religion  ideology  contact_lgbti  social_class
##   2   2  religion  ideology  contact_lgbti  social_class
##   2   3  religion  ideology  contact_lgbti  social_class
##   3   1  religion  ideology  contact_lgbti  social_class
##   3   2  religion  ideology  contact_lgbti  social_class
##   3   3  religion  ideology  contact_lgbti  social_class
##   4   1  religion  ideology  contact_lgbti  social_class
##   4   2  religion  ideology  contact_lgbti  social_class
##   4   3  religion  ideology  contact_lgbti  social_class
##   5   1  religion  ideology  contact_lgbti  social_class
##   5   2  religion  ideology  contact_lgbti  social_class
##   5   3  religion  ideology  contact_lgbti  social_class
summary(imputed_data)
## Class: mids
## Number of multiple imputations:  3 
## Imputation methods:
##       isocntry      trans_doc         female            age       religion 
##             ""             ""             ""             ""      "polyreg" 
## marital_status personal_satis       ideology  contact_lgbti   social_class 
##             ""             ""          "pmm"       "logreg"      "polyreg" 
##     occupation 
##             "" 
## PredictorMatrix:
##                isocntry trans_doc female age religion marital_status
## isocntry              0         1      1   1        1              1
## trans_doc             1         0      1   1        1              1
## female                1         1      0   1        1              1
## age                   1         1      1   0        1              1
## religion              1         1      1   1        0              1
## marital_status        1         1      1   1        1              0
##                personal_satis ideology contact_lgbti social_class occupation
## isocntry                    1        1             1            1          1
## trans_doc                   1        1             1            1          1
## female                      1        1             1            1          1
## age                         1        1             1            1          1
## religion                    1        1             1            1          1
## marital_status              1        1             1            1          1

Merging datasets again

For modeling, the best would be to use the function with() and all 3 generated datasets of mice, which would give us a better result but also take more time. Due to practicality and the point of the exercise not being focused on perfect mice usage, we take the decision of extracting and keep working with only one of the datasets.

imputed_data_complete <- complete(imputed_data)

write.csv(imputed_data_complete, "imputed_data_complete.csv")
data <- imputed_data_complete |> 
  left_join(rainbow, by = c("isocntry" = "country_iso")) |> 
  left_join(trans, by = c("isocntry" = "country_iso")) |> 
  left_join(economic_indicators, by = c("isocntry" = "country_iso")) |> 
  left_join(religiosity, by = c("isocntry" = "country_iso"))

Explaining Cross-Country Differences

First of all, we performed a brief data preparation to enhance stability and model performance. First, we transformed GDP per capita with a natural logarithm to reduce the asymmetry of its distribution and make nonlinear relationships with other variables more manageable. Then, we reassigned the reference level of the categorical variable on contact with LGBTI people, setting “No contact” as the base category.

data <- data |> 
  mutate(gdp_pc = log(gdp_pc))

data$contact_lgbti <- relevel(data$contact_lgbti, ref = "No contact")

str(data)
## 'data.frame':    27271 obs. of  16 variables:
##  $ isocntry           : chr  "BE" "BE" "BE" "BE" ...
##  $ trans_doc          : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
##  $ female             : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 1 2 2 ...
##  $ age                : num  51 62 38 29 63 41 48 88 44 45 ...
##  $ religion           : Factor w/ 5 levels "Christians","Muslims",..: 1 1 3 3 1 3 3 1 1 1 ...
##  $ marital_status     : Factor w/ 5 levels "Divorced","Married",..: 1 2 2 2 2 2 2 1 2 2 ...
##  $ personal_satis     : Factor w/ 2 levels "Not satisfied",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ideology           : num  7 8 4 10 8 7 3 9 3 8 ...
##  $ contact_lgbti      : Factor w/ 2 levels "No contact","Contact": 1 1 2 2 1 1 1 1 2 2 ...
##  $ social_class       : Factor w/ 3 levels "High class","Middle class",..: 3 2 3 1 1 2 2 2 2 1 ...
##  $ occupation         : Factor w/ 5 levels "Employed","Retired",..: 1 5 1 1 5 1 1 2 5 1 ...
##  $ rain_ind           : num  78.5 78.5 78.5 78.5 78.5 ...
##  $ self_determination : chr  "1" "1" "1" "1" ...
##  $ gini               : num  26.6 26.6 26.6 26.6 26.6 26.6 26.6 26.6 26.6 26.6 ...
##  $ gdp_pc             : num  10.7 10.7 10.7 10.7 10.7 ...
##  $ religiosity_percent: num  0.689 0.689 0.689 0.689 0.689 ...

Then, we check for the multicollinearity in our model:

model <- glm(trans_doc ~ scale(age) + 
               female + 
               religion + 
               occupation + 
               marital_status + 
               personal_satis +
               contact_lgbti +
               scale(ideology) +
               social_class +
               scale(gdp_pc) +
               self_determination + 
               scale(gini) + 
               scale(rain_ind) + 
               scale(religiosity_percent), 
             family = "binomial", data = data)

summary(model)

exp(coef(model))

vif(model) 

We use the Variance Inflation Factor (VIF) to identify highly correlated predictor variables. In the results, the GVIF values indicate that the multicollinearity is not problematic, as most VIFs are below 2. This suggests there is no significant multicollinearity between the individual variables and collective variables, ensuring reliable coefficient estimates and proper interpretation of each variable’s effect.

In every model, we incorporate both age and its squared term (age^2) to compare differences. The rationale behind this is that the linear term (age) and the quadratic term (age^2) work together to model a curvilinear relationship. Even though age itself may not be significant, we retain it in the model because the combined effect of both terms allows for capturing a potential non-linear relationship with the target variable. This helps to better model the relationship between age and support for transgender rights.

Next, we test different models to identify key factors influencing support for transgender individuals’ right to decide on their own document.

model2 <- glmer(trans_doc ~ 
                  scale(age) + 
                  female + 
                  religion +
                  occupation +
                  marital_status + 
                  personal_satis +
                  contact_lgbti +
                  scale(ideology) +
                  social_class +
                  (1|isocntry), 
                family = "binomial", 
                data = data)
summary(model2)


model3 <- glmer(trans_doc ~ 
                  scale(age) + 
                  I(scale(age)^2) +
                  female + 
                  occupation +
                  religion*scale(religiosity_percent) + 
                  marital_status*scale(religiosity_percent) + 
                  personal_satis*self_determination +
                  contact_lgbti*scale(rain_ind) +
                  contact_lgbti*self_determination +
                  scale(ideology)*scale(gdp_pc) +
                  social_class*scale(gini) +
                  (1|isocntry), 
                family = "binomial", 
                data = data)
summary(model3)

We started with model2, a generalized linear mixed-effects model that includes a random intercept for each country to account for country-specific variation. Then, we tried model3, a more complex model that includes interaction terms (e.g., religion and religiosity, marital status and religiosity) to capture more nuanced relationships. Some variables like age, self-determination, and GDP per capita were not significant on their own, but interactions like religion and marital status showed relevance.

In model4, we extended the complexity further by adding additional interaction terms and quadratic terms, similar to model3. This model includes interactions like religion and religiosity percent, personal satisfaction and self-determination, and contact with LGBTI people with variables such as rain index and self-determination. We also included the interaction between ideology and GDP per capita and social class with the Gini index.

model4 <- glmer(trans_doc ~ 
                  scale(age) + 
                  I(scale(age)^2) + 
                  female + 
                  religion +
                  occupation + 
                  religion:scale(religiosity_percent) + 
                  marital_status + 
                  personal_satis + 
                  personal_satis:self_determination +
                  contact_lgbti*scale(rain_ind) +
                  contact_lgbti:self_determination +
                  scale(ideology) +
                  scale(ideology):scale(gdp_pc) +
                  social_class*scale(gini) +
                  (1|isocntry), 
                family = "binomial", 
                data = data)
summary(model4)

However, after reviewing the results, we found that certain variables, such as self-determination, personal satisfaction, Gini, and social class, were not significant.

In model5, we included additional variables and interactions, while also refining the previous interactions. This model examines the impact of variables like age, gender, religion, occupation, and their interactions, such as religion with religiosity percentage, contact with LGBTI people with rain index, and ideology with GDP per capita.

model5 <- glmer(trans_doc ~ 
                  scale(age) +
                  I(scale(age) ^ 2) + 
                  female + 
                  occupation + 
                  religion +
                  religion:scale(religiosity_percent) + 
                  marital_status + 
                  personal_satis +
                  contact_lgbti*scale(rain_ind) +
                  contact_lgbti:self_determination +
                  ideology + 
                  ideology:scale(gdp_pc) +
                  social_class +
                  scale(gini) +
                  (1|isocntry), 
                family = "binomial", 
                data = data)
summary(model5)

Similar to the previous results, we found that Gini, contact with LGBTI people, self-determination, and social class were not significant.

In model6, we further refined the model by removing non-significant variables from previous iterations, resulting in a more streamlined approach

model6 <- glmer(trans_doc ~ 
                  scale(age) +
                  I(scale(age) ^ 2) + 
                  female + 
                  occupation +
                  religion +
                  religion:scale(religiosity_percent) + 
                  marital_status + 
                  personal_satis +
                  contact_lgbti*scale(rain_ind) +
                  ideology + 
                  ideology:scale(gdp_pc) +
                  (1|isocntry), 
                family = "binomial",
                data = data)
summary(model6)

After evaluating the results, we found that marital status was not significant in predicting support for transgender rights.

In model7, we further simplified the model by removing marital status based on the previous findings.

model7 <- glmer(trans_doc ~ 
                  scale(age) +
                  I(scale(age) ^ 2) + 
                  female + 
                  occupation +
                  religion +
                  religion:scale(religiosity_percent) + 
                  personal_satis +
                  contact_lgbti*scale(rain_ind) +
                  ideology + 
                  ideology:scale(gdp_pc) +
                  (1|isocntry), 
                family = "binomial", 
                data = data)
summary(model7)

In model8, we introduced random slopes alongside random intercepts. Specifically, we modeled age, gender (female), and ideology as having random slopes by country. The random intercepts allow for country-specific baselines, and the random slopes allow us to capture how the relationships between these predictors and the target variable may differ between countries.

model8 <- glmer(trans_doc ~ 
                  scale(age) +
                  I(age ^ 2) + 
                  female + 
                  occupation +
                  religion +
                  religion:scale(religiosity_percent) + 
                  personal_satis +
                  contact_lgbti*scale(rain_ind) +
                  ideology + 
                  ideology:scale(gdp_pc) +
                  (1 + scale(age) + female + scale(ideology)|isocntry), 
                family = "binomial", 
                data = data)
summary(model8)

ranef(model8) |>  # This is for extracting random intercepts and slopes
  str()

The random effects output shows how intercepts and slopes vary by country. The (Intercept) column represents country-specific deviations from the overall intercept. The scale(age), female1, and scale(ideology) columns indicate how these predictors’ effects differ across countries. The values suggest small but varying influences of age, gender, and ideology on the target variable, depending on the country.

The next model includes additional predictors like female, occupation, and religion interactions with religiosity_percent, along with other interaction terms such as contact_lgbti with rainbow index and ideology with GDP per capita. We also kept the random slopes for age, female, and ideology across countries, as these allow the effects of these variables to vary by country.

step_model <- glmer(trans_doc ~ 
                  scale(age) + 
                  I(scale(age)^2) +
                  female + 
                  occupation +
                  religion*scale(religiosity_percent) + 
                  marital_status*scale(religiosity_percent) + 
                  personal_satis*self_determination +
                  contact_lgbti*scale(rain_ind) +
                  contact_lgbti*self_determination +
                  scale(ideology)*scale(gdp_pc) +
                  social_class*scale(gini) +
                  (1 + scale(age) + 
                     female +
                     scale(ideology)
                     |isocntry), 
                family = "binomial",
                data = data)

summary(step_model)

The results showed that age (quadratic), gender, occupation, personal satisfaction, contact with LGBTI people, and ideology were relevant predictors. Additionally, interactions between religion and religiosity, as well as ideology and GDP, played a significant role. However, variables like marital status, social class, and the Gini index were not significant, leading to their exclusion in the final model:

final_model <- glmer(trans_doc ~ 
                      scale(age) + 
                      I(scale(age)^2) +
                      female + 
                      occupation +
                      religion*scale(religiosity_percent) +  
                      personal_satis +
                      contact_lgbti*scale(rain_ind) +
                      self_determination +
                      scale(ideology)*scale(gdp_pc) +
                      (1 + scale(age) + 
                         female +
                         scale(ideology)
                       |isocntry), 
                    family = "binomial",
                    data = data)

covariate_labels <- c(
  "Age of respondent", 
  "Age of respondent (squared)", 
  "Gender of respondent (female)", 
  "Occupation of participant: Retired", 
  "Occupation of participant: Self employed", 
  "Occupation of participant: Student", 
  "Occupation of participant: Unemployed",
  "Religious affiliation of participant: Muslims", 
  "Religious affiliation of participant: Not religious", 
  "Religious affiliation of participant: Orthodox Christian", 
  "Religious affiliation of participant: Other religions", 
  "Religiosity of the country's population (%)", 
  "Personal satisfaction of participants: Satisfied", 
  "Contact with LGBTI community of participant: No Contact",  
  "Rainbow Europe Country Indicator", 
  "Law in favor of Gender identity self-determination in a country: Exists", 
  "Ideology scale of participant", 
  "GDP of countries (%) ", 
  "Interaction between Muslim religion of participant and percentage of religiosity in the country", 
  "Interaction between No religion of participant and percentage of religiosity in the country", 
  "Interaction between Orthodox Christian religion of participant and percentage of religiosity in the country",
  "Interaction between Other religion of participant and percentage of religiosity in the country", 
  "Interaction between participant having No contact with LGBTI community and Rainbow Europe Country Indicator",  
  "Interaction between ideology of a participant and GDP (%) of a country"
)


stargazer(final_model, type = "text",
          covariate.labels = covariate_labels,
          title = "Regression Results",
          dep.var.labels = "Dependent Variable: Trans Name",
          star.cutoffs = c(0.05, 0.01, 0.001),
          digits = 3)
## 
## Regression Results
## ==========================================================================================================================================
##                                                                                                                  Dependent variable:      
##                                                                                                             ------------------------------
##                                                                                                             Dependent Variable: Trans Name
## ------------------------------------------------------------------------------------------------------------------------------------------
## Age of respondent                                                                                                       -0.044            
##                                                                                                                        (0.045)            
##                                                                                                                                           
## Age of respondent (squared)                                                                                           -0.075***           
##                                                                                                                        (0.019)            
##                                                                                                                                           
## Gender of respondent (female)                                                                                          0.377***           
##                                                                                                                        (0.045)            
##                                                                                                                                           
## Occupation of participant: Retired                                                                                     -0.164**           
##                                                                                                                        (0.053)            
##                                                                                                                                           
## Occupation of participant: Self employed                                                                                0.137*            
##                                                                                                                        (0.061)            
##                                                                                                                                           
## Occupation of participant: Student                                                                                     0.226**            
##                                                                                                                        (0.084)            
##                                                                                                                                           
## Occupation of participant: Unemployed                                                                                   0.053             
##                                                                                                                        (0.056)            
##                                                                                                                                           
## Religious affiliation of participant: Muslims                                                                         -0.463***           
##                                                                                                                        (0.134)            
##                                                                                                                                           
## Religious affiliation of participant: Not religious                                                                    0.361***           
##                                                                                                                        (0.047)            
##                                                                                                                                           
## Religious affiliation of participant: Orthodox Christian                                                                -0.023            
##                                                                                                                        (0.075)            
##                                                                                                                                           
## Religious affiliation of participant: Other religions                                                                   -0.125            
##                                                                                                                        (0.078)            
##                                                                                                                                           
## Religiosity of the country's population (%)                                                                             -0.190            
##                                                                                                                        (0.097)            
##                                                                                                                                           
## Personal satisfaction of participants: Satisfied                                                                       0.376***           
##                                                                                                                        (0.044)            
##                                                                                                                                           
## Contact with LGBTI community of participant: No Contact                                                                0.903***           
##                                                                                                                        (0.035)            
##                                                                                                                                           
## Rainbow Europe Country Indicator                                                                                        0.248             
##                                                                                                                        (0.146)            
##                                                                                                                                           
## Law in favor of Gender identity self-determination in a country: Exists                                                 0.470             
##                                                                                                                        (0.254)            
##                                                                                                                                           
## Ideology scale of participant                                                                                         -0.187***           
##                                                                                                                        (0.027)            
##                                                                                                                                           
## GDP of countries (%)                                                                                                    0.206             
##                                                                                                                        (0.121)            
##                                                                                                                                           
## Interaction between Muslim religion of participant and percentage of religiosity in the country                         -0.046            
##                                                                                                                        (0.216)            
##                                                                                                                                           
## Interaction between No religion of participant and percentage of religiosity in the country                             0.070             
##                                                                                                                        (0.041)            
##                                                                                                                                           
## Interaction between Orthodox Christian religion of participant and percentage of religiosity in the country            0.206**            
##                                                                                                                        (0.064)            
##                                                                                                                                           
## Interaction between Other religion of participant and percentage of religiosity in the country                        -0.204***           
##                                                                                                                        (0.059)            
##                                                                                                                                           
## Interaction between participant having No contact with LGBTI community and Rainbow Europe Country Indicator             0.046             
##                                                                                                                        (0.036)            
##                                                                                                                                           
## Interaction between ideology of a participant and GDP (%) of a country                                                -0.113***           
##                                                                                                                        (0.030)            
##                                                                                                                                           
## Constant                                                                                                              -0.560***           
##                                                                                                                        (0.129)            
##                                                                                                                                           
## ------------------------------------------------------------------------------------------------------------------------------------------
## Observations                                                                                                            24,057            
## Log Likelihood                                                                                                       -13,074.330          
## Akaike Inf. Crit.                                                                                                     26,218.650          
## Bayesian Inf. Crit.                                                                                                   26,501.740          
## ==========================================================================================================================================
## Note:                                                                                                        *p<0.05; **p<0.01; ***p<0.001
lme4::ranef(final_model) %>% 
  str 
## List of 1
##  $ isocntry:'data.frame':    28 obs. of  4 variables:
##   ..$ (Intercept)    : num [1:28] 0.0232 -0.5231 -0.7716 0.1831 -0.7099 ...
##   ..$ scale(age)     : num [1:28] -0.126 0.544 -0.251 0.131 0.2 ...
##   ..$ female1        : num [1:28] 0.0042 0.0964 -0.1408 -0.0353 -0.0493 ...
##   ..$ scale(ideology): num [1:28] -0.00301 0.112 0.13851 -0.01833 0.19367 ...
##   ..- attr(*, "postVar")= num [1:4, 1:4, 1:28] 0.006147 0.000703 -0.003287 -0.001168 0.000703 ...
##  - attr(*, "class")= chr "ranef.mer"

In final_model, we have refined the model to include both age and its squared term (age^2), maintaining them together to model a curvilinear relationship between age and the target variable.

This final_model is chosen over the step_model, despite having a higher AIC, because the step_model reduces this metric spuriously by eliminating variables that, although not individually significant, contribute theoretical coherence and explanatory value to the model. Over-simplification can falsely improve the AIC without actually leading to a better fit.

Regarding the results, support for trans rights is influenced by several factors. Being a woman significantly increases support, while age has a curvilinear effect, with a decline at older ages. Being retired reduces support, and religiosity also plays a role, particularly in some religious groups where lower acceptance is observed. In contrast, being non-religious is associated with a more favorable stance. Personal satisfaction and contact with LGBTI individuals have strong positive and highly significant effects, suggesting that social interaction strongly shapes attitudes. Additionally, while a more conservative ideology decreases support, this effect is moderated by the country’s GDP per capita.

In terms of random effects, there is variation between countries in the intercept and in the slopes of age, gender, and ideology, indicating that these factors do not influence attitudes in the same way across national contexts. Overall, the model confirms that support for trans rights is a multifaceted phenomenon shaped by both individual and structural factors.

Predictive models

For the predictive part of the model we are going to try the predictive capability of our explanatory model and after that we will try some machine learning techniques as K-nearest neighbours, decision trees and random forest.

We will have two type of training and testing sets. The first one is going to be used in the prediction with the explanatory model and the knn and it will consist in a training set that will contain 70% of the data of each country and in a testing set that will have 30% of the data of each country. The reason behind this splitting is that in the explanatory model we will have random intercept and slopes for each country and in the knn is because it does not admit new levels.

The second set will follow the same 70/30 structure but with a crucial difference—it will leave out one country entirely. This approach allows us to evaluate whether the models can generalize effectively to completely unseen datasets, providing a more robust test of their predictive performance.

set.seed(123)

# Reinsert serialid to be able to pinpoint observations 
data$serialid <- serialid 

# Training and testing set 
training <- data %>%
  group_by(isocntry) %>%
  sample_frac(0.7) %>%
  ungroup()

testing <- data %>%
  anti_join(training, by = c("serialid")) 

# Remove serialid from the model dataset
training <- training %>%
  select(-serialid)

testing <- testing  %>%
  select(-serialid) 

# Training model 
final_model <- glmer(trans_doc ~ 
                       scale(age) + 
                       I(scale(age)^2) +
                       female + 
                       occupation +
                       religion*scale(religiosity_percent) +  
                       personal_satis +
                       contact_lgbti*scale(rain_ind) +
                       self_determination +
                       scale(ideology)*scale(gdp_pc) +
                       (1  + scale(age) +
                          female +
                          scale(ideology)
                        |isocntry), 
                     family = binomial(link = "logit"),
                     control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 100000)),
                     data = training)
## boundary (singular) fit: see help('isSingular')
summary(final_model)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: trans_doc ~ scale(age) + I(scale(age)^2) + female + occupation +  
##     religion * scale(religiosity_percent) + personal_satis +  
##     contact_lgbti * scale(rain_ind) + self_determination + scale(ideology) *  
##     scale(gdp_pc) + (1 + scale(age) + female + scale(ideology) |      isocntry)
##    Data: training
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e+05))
## 
##      AIC      BIC   logLik deviance df.resid 
##  18253.4  18524.0  -9091.7  18183.4    16797 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -7.7042 -0.7378  0.3432  0.6528  4.5987 
## 
## Random effects:
##  Groups   Name            Variance Std.Dev. Corr             
##  isocntry (Intercept)     0.22289  0.4721                    
##           scale(age)      0.03868  0.1967   -0.31            
##           female1         0.04215  0.2053    0.63 -0.01      
##           scale(ideology) 0.01806  0.1344   -0.82  0.26 -0.08
## Number of obs: 16832, groups:  isocntry, 28
## 
## Fixed effects:
##                                                       Estimate Std. Error
## (Intercept)                                           -0.52059    0.12780
## scale(age)                                            -0.02680    0.04845
## I(scale(age)^2)                                       -0.06600    0.02281
## female1                                                0.39057    0.05474
## occupationRetired                                     -0.22608    0.06320
## occupationSelf employed                                0.14972    0.07341
## occupationStudent                                      0.29905    0.10136
## occupationUnemployed                                   0.03343    0.06728
## religionMuslims                                       -0.43853    0.16411
## religionNot religious                                  0.38864    0.05693
## religionOrthodox Chrsitian                            -0.05507    0.08797
## religionOther religions                               -0.09002    0.09448
## scale(religiosity_percent)                            -0.19284    0.09159
## personal_satisSatisfied                                0.35552    0.05257
## contact_lgbtiContact                                   0.89728    0.04253
## scale(rain_ind)                                        0.19637    0.11173
## self_determination1                                    0.44641    0.23026
## scale(ideology)                                       -0.20122    0.03209
## scale(gdp_pc)                                          0.25266    0.10799
## religionMuslims:scale(religiosity_percent)            -0.31707    0.27026
## religionNot religious:scale(religiosity_percent)       0.07324    0.04863
## religionOrthodox Chrsitian:scale(religiosity_percent)  0.21027    0.07579
## religionOther religions:scale(religiosity_percent)    -0.21165    0.07232
## contact_lgbtiContact:scale(rain_ind)                   0.06766    0.04362
## scale(ideology):scale(gdp_pc)                         -0.11946    0.03368
##                                                       z value Pr(>|z|)    
## (Intercept)                                            -4.073 4.63e-05 ***
## scale(age)                                             -0.553 0.580192    
## I(scale(age)^2)                                        -2.894 0.003809 ** 
## female1                                                 7.135 9.68e-13 ***
## occupationRetired                                      -3.577 0.000347 ***
## occupationSelf employed                                 2.039 0.041404 *  
## occupationStudent                                       2.951 0.003173 ** 
## occupationUnemployed                                    0.497 0.619267    
## religionMuslims                                        -2.672 0.007535 ** 
## religionNot religious                                   6.827 8.68e-12 ***
## religionOrthodox Chrsitian                             -0.626 0.531312    
## religionOther religions                                -0.953 0.340682    
## scale(religiosity_percent)                             -2.105 0.035252 *  
## personal_satisSatisfied                                 6.763 1.35e-11 ***
## contact_lgbtiContact                                   21.098  < 2e-16 ***
## scale(rain_ind)                                         1.758 0.078820 .  
## self_determination1                                     1.939 0.052534 .  
## scale(ideology)                                        -6.270 3.61e-10 ***
## scale(gdp_pc)                                           2.340 0.019301 *  
## religionMuslims:scale(religiosity_percent)             -1.173 0.240724    
## religionNot religious:scale(religiosity_percent)        1.506 0.132057    
## religionOrthodox Chrsitian:scale(religiosity_percent)   2.775 0.005528 ** 
## religionOther religions:scale(religiosity_percent)     -2.927 0.003426 ** 
## contact_lgbtiContact:scale(rain_ind)                    1.551 0.120842    
## scale(ideology):scale(gdp_pc)                          -3.547 0.000390 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation matrix not shown by default, as p = 25 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it
## optimizer (bobyqa) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
# Complete cases 
training <- training[complete.cases(training), ]
testing <- testing[complete.cases(testing), ]

# Prediction
pred <- predict(final_model, newdata = testing, type = "response")

# Pred() worked: 
prediction <- as.factor(ifelse(pred > 0.5, 1, 0))

# Final: 
confusionMatrix(prediction, testing$trans_doc)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1711  840
##          1 1235 3439
##                                           
##                Accuracy : 0.7128          
##                  95% CI : (0.7022, 0.7232)
##     No Information Rate : 0.5922          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3927          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5808          
##             Specificity : 0.8037          
##          Pos Pred Value : 0.6707          
##          Neg Pred Value : 0.7358          
##              Prevalence : 0.4078          
##          Detection Rate : 0.2368          
##    Detection Prevalence : 0.3531          
##       Balanced Accuracy : 0.6922          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(prediction, testing$trans_doc)$overall[1:2]
##  Accuracy     Kappa 
## 0.7128028 0.3926809

We obtain a pretty good accuracy of more than the 70%.

K-NN

ctrl <- trainControl(method = "repeatedcv", 
                     number = 5,
                     classProbs = T,
                     summaryFunction=twoClassSummary,
                     verboseIter = T)

training <- as.data.frame(training)

training$isocntry <- factor(training$isocntry)
testing$isocntry <- factor(testing$isocntry)

training <- training |> 
  mutate(trans_doc = case_when(
    trans_doc == "1" ~ "Yes",
    trans_doc == "0" ~ "No",
    TRUE ~ NA),
    trans_doc = factor(trans_doc))

testing <- testing |> 
  mutate(trans_doc = case_when(
    trans_doc == "1" ~ "Yes",
    trans_doc == "0" ~ "No",
    TRUE ~ NA),
    trans_doc = factor(trans_doc))

str(training)
## 'data.frame':    16832 obs. of  16 variables:
##  $ isocntry           : Factor w/ 28 levels "AT","BE","BG",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ trans_doc          : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 1 1 1 ...
##  $ female             : Factor w/ 2 levels "0","1": 2 2 1 1 2 1 2 1 2 1 ...
##  $ age                : num  53 34 48 31 38 45 31 68 58 55 ...
##  $ religion           : Factor w/ 5 levels "Christians","Muslims",..: 1 1 1 3 1 1 3 1 1 1 ...
##  $ marital_status     : Factor w/ 5 levels "Divorced","Married",..: 2 4 2 4 2 4 2 2 2 2 ...
##  $ personal_satis     : Factor w/ 2 levels "Not satisfied",..: 2 1 2 1 2 2 1 2 2 2 ...
##  $ ideology           : num  7 3 6 9 2 5 8 7 7 5 ...
##  $ contact_lgbti      : Factor w/ 2 levels "No contact","Contact": 2 1 2 1 1 2 1 1 1 1 ...
##  $ social_class       : Factor w/ 3 levels "High class","Middle class",..: 2 2 2 3 2 2 2 2 2 3 ...
##  $ occupation         : Factor w/ 5 levels "Employed","Retired",..: 5 1 3 5 1 1 1 2 3 1 ...
##  $ rain_ind           : num  49.6 49.6 49.6 49.6 49.6 ...
##  $ self_determination : chr  "0" "0" "0" "0" ...
##  $ gini               : num  30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 ...
##  $ gdp_pc             : num  10.7 10.7 10.7 10.7 10.7 ...
##  $ religiosity_percent: num  0.854 0.854 0.854 0.854 0.854 ...
knnFit <- train(trans_doc ~ ., 
                data = training,
                method = "kknn",   
                preProc=c('scale','center'),
                tuneLength = 10,
                metric="ROC",
                trControl = ctrl)
## + Fold1.Rep1: kmax= 5, distance=2, kernel=optimal 
## - Fold1.Rep1: kmax= 5, distance=2, kernel=optimal 
## + Fold1.Rep1: kmax= 7, distance=2, kernel=optimal 
## - Fold1.Rep1: kmax= 7, distance=2, kernel=optimal 
## + Fold1.Rep1: kmax= 9, distance=2, kernel=optimal 
## - Fold1.Rep1: kmax= 9, distance=2, kernel=optimal 
## + Fold1.Rep1: kmax=11, distance=2, kernel=optimal 
## - Fold1.Rep1: kmax=11, distance=2, kernel=optimal 
## + Fold1.Rep1: kmax=13, distance=2, kernel=optimal 
## - Fold1.Rep1: kmax=13, distance=2, kernel=optimal 
## + Fold1.Rep1: kmax=15, distance=2, kernel=optimal 
## - Fold1.Rep1: kmax=15, distance=2, kernel=optimal 
## + Fold1.Rep1: kmax=17, distance=2, kernel=optimal 
## - Fold1.Rep1: kmax=17, distance=2, kernel=optimal 
## + Fold1.Rep1: kmax=19, distance=2, kernel=optimal 
## - Fold1.Rep1: kmax=19, distance=2, kernel=optimal 
## + Fold1.Rep1: kmax=21, distance=2, kernel=optimal 
## - Fold1.Rep1: kmax=21, distance=2, kernel=optimal 
## + Fold1.Rep1: kmax=23, distance=2, kernel=optimal 
## - Fold1.Rep1: kmax=23, distance=2, kernel=optimal 
## + Fold2.Rep1: kmax= 5, distance=2, kernel=optimal 
## - Fold2.Rep1: kmax= 5, distance=2, kernel=optimal 
## + Fold2.Rep1: kmax= 7, distance=2, kernel=optimal 
## - Fold2.Rep1: kmax= 7, distance=2, kernel=optimal 
## + Fold2.Rep1: kmax= 9, distance=2, kernel=optimal 
## - Fold2.Rep1: kmax= 9, distance=2, kernel=optimal 
## + Fold2.Rep1: kmax=11, distance=2, kernel=optimal 
## - Fold2.Rep1: kmax=11, distance=2, kernel=optimal 
## + Fold2.Rep1: kmax=13, distance=2, kernel=optimal 
## - Fold2.Rep1: kmax=13, distance=2, kernel=optimal 
## + Fold2.Rep1: kmax=15, distance=2, kernel=optimal 
## - Fold2.Rep1: kmax=15, distance=2, kernel=optimal 
## + Fold2.Rep1: kmax=17, distance=2, kernel=optimal 
## - Fold2.Rep1: kmax=17, distance=2, kernel=optimal 
## + Fold2.Rep1: kmax=19, distance=2, kernel=optimal 
## - Fold2.Rep1: kmax=19, distance=2, kernel=optimal 
## + Fold2.Rep1: kmax=21, distance=2, kernel=optimal 
## - Fold2.Rep1: kmax=21, distance=2, kernel=optimal 
## + Fold2.Rep1: kmax=23, distance=2, kernel=optimal 
## - Fold2.Rep1: kmax=23, distance=2, kernel=optimal 
## + Fold3.Rep1: kmax= 5, distance=2, kernel=optimal 
## - Fold3.Rep1: kmax= 5, distance=2, kernel=optimal 
## + Fold3.Rep1: kmax= 7, distance=2, kernel=optimal 
## - Fold3.Rep1: kmax= 7, distance=2, kernel=optimal 
## + Fold3.Rep1: kmax= 9, distance=2, kernel=optimal 
## - Fold3.Rep1: kmax= 9, distance=2, kernel=optimal 
## + Fold3.Rep1: kmax=11, distance=2, kernel=optimal 
## - Fold3.Rep1: kmax=11, distance=2, kernel=optimal 
## + Fold3.Rep1: kmax=13, distance=2, kernel=optimal 
## - Fold3.Rep1: kmax=13, distance=2, kernel=optimal 
## + Fold3.Rep1: kmax=15, distance=2, kernel=optimal 
## - Fold3.Rep1: kmax=15, distance=2, kernel=optimal 
## + Fold3.Rep1: kmax=17, distance=2, kernel=optimal 
## - Fold3.Rep1: kmax=17, distance=2, kernel=optimal 
## + Fold3.Rep1: kmax=19, distance=2, kernel=optimal 
## - Fold3.Rep1: kmax=19, distance=2, kernel=optimal 
## + Fold3.Rep1: kmax=21, distance=2, kernel=optimal 
## - Fold3.Rep1: kmax=21, distance=2, kernel=optimal 
## + Fold3.Rep1: kmax=23, distance=2, kernel=optimal 
## - Fold3.Rep1: kmax=23, distance=2, kernel=optimal 
## + Fold4.Rep1: kmax= 5, distance=2, kernel=optimal 
## - Fold4.Rep1: kmax= 5, distance=2, kernel=optimal 
## + Fold4.Rep1: kmax= 7, distance=2, kernel=optimal 
## - Fold4.Rep1: kmax= 7, distance=2, kernel=optimal 
## + Fold4.Rep1: kmax= 9, distance=2, kernel=optimal 
## - Fold4.Rep1: kmax= 9, distance=2, kernel=optimal 
## + Fold4.Rep1: kmax=11, distance=2, kernel=optimal 
## - Fold4.Rep1: kmax=11, distance=2, kernel=optimal 
## + Fold4.Rep1: kmax=13, distance=2, kernel=optimal 
## - Fold4.Rep1: kmax=13, distance=2, kernel=optimal 
## + Fold4.Rep1: kmax=15, distance=2, kernel=optimal 
## - Fold4.Rep1: kmax=15, distance=2, kernel=optimal 
## + Fold4.Rep1: kmax=17, distance=2, kernel=optimal 
## - Fold4.Rep1: kmax=17, distance=2, kernel=optimal 
## + Fold4.Rep1: kmax=19, distance=2, kernel=optimal 
## - Fold4.Rep1: kmax=19, distance=2, kernel=optimal 
## + Fold4.Rep1: kmax=21, distance=2, kernel=optimal 
## - Fold4.Rep1: kmax=21, distance=2, kernel=optimal 
## + Fold4.Rep1: kmax=23, distance=2, kernel=optimal 
## - Fold4.Rep1: kmax=23, distance=2, kernel=optimal 
## + Fold5.Rep1: kmax= 5, distance=2, kernel=optimal 
## - Fold5.Rep1: kmax= 5, distance=2, kernel=optimal 
## + Fold5.Rep1: kmax= 7, distance=2, kernel=optimal 
## - Fold5.Rep1: kmax= 7, distance=2, kernel=optimal 
## + Fold5.Rep1: kmax= 9, distance=2, kernel=optimal 
## - Fold5.Rep1: kmax= 9, distance=2, kernel=optimal 
## + Fold5.Rep1: kmax=11, distance=2, kernel=optimal 
## - Fold5.Rep1: kmax=11, distance=2, kernel=optimal 
## + Fold5.Rep1: kmax=13, distance=2, kernel=optimal 
## - Fold5.Rep1: kmax=13, distance=2, kernel=optimal 
## + Fold5.Rep1: kmax=15, distance=2, kernel=optimal 
## - Fold5.Rep1: kmax=15, distance=2, kernel=optimal 
## + Fold5.Rep1: kmax=17, distance=2, kernel=optimal 
## - Fold5.Rep1: kmax=17, distance=2, kernel=optimal 
## + Fold5.Rep1: kmax=19, distance=2, kernel=optimal 
## - Fold5.Rep1: kmax=19, distance=2, kernel=optimal 
## + Fold5.Rep1: kmax=21, distance=2, kernel=optimal 
## - Fold5.Rep1: kmax=21, distance=2, kernel=optimal 
## + Fold5.Rep1: kmax=23, distance=2, kernel=optimal 
## - Fold5.Rep1: kmax=23, distance=2, kernel=optimal 
## Aggregating results
## Selecting tuning parameters
## Fitting kmax = 23, distance = 2, kernel = optimal on full training set
knnProb <- predict(knnFit, testing, type="prob", )
prediction <- as.factor(ifelse(knnProb[,2] > 0.5, "Yes", "No"))


confusionMatrix(prediction, testing$trans_doc)$table
##           Reference
## Prediction   No  Yes
##        No  1653  868
##        Yes 1293 3411
confusionMatrix(prediction, testing$trans_doc)$overall[1:2]
##  Accuracy     Kappa 
## 0.7008997 0.3664838

This model has a little worst accuracy than the our explanatory model.

Decision trees

From now on, we will do not have Finland in the training set.

# Separate Finland as our test country 
data$serialid <- serialid 

no_fi <- data |> 
  filter(isocntry != "FI") |> 
  mutate(trans_doc = case_when(
    trans_doc == "1" ~ "Yes",
    trans_doc == "0" ~ "No",
    TRUE ~ NA
  ))

test_country_fi <- data |> 
  filter(isocntry == "FI")|> 
  mutate(trans_doc = case_when(
    trans_doc == "1" ~ "Yes",
    trans_doc == "0" ~ "No",
    TRUE ~ NA
  ))

# Training and testing set 
training2 <- no_fi %>%
  group_by(isocntry) %>%
  sample_frac(0.7) %>%
  ungroup()

test_no_fi <- no_fi %>%
  anti_join(training2, by = c("serialid"))  

testing2 <- test_no_fi |> 
  bind_rows(test_country_fi)

# Remove serialid from the model dataset
training2 <- training2 %>%
  select(-c(serialid, isocntry))

testing2 <- testing2  %>%
  select(-c(serialid, isocntry)) 

test_country_fi <- test_country_fi |> 
  select(-c(serialid, isocntry)) 
training2 <- training2[complete.cases(training2), ]
testing2 <- testing2[complete.cases(testing2), ]
test_no_fi <- test_no_fi[complete.cases(test_no_fi), ]
test_country_fi <- test_country_fi[complete.cases(test_country_fi), ]
training2$trans_doc <- as.factor(training2$trans_doc)

training2 <- as.data.frame(training2)

control <- rpart.control(minsplit = 30, maxdepth = 10, cp=0.01)

model <- trans_doc ~ .
dtFit <- rpart(model, data=training2, method = "class", control = control)
summary(dtFit)
## Call:
## rpart(formula = model, data = training2, method = "class", control = control)
##   n= 16203 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.18862138      0 1.0000000 1.0000000 0.009572552
## 2 0.04784542      1 0.8113786 0.8113786 0.009153963
## 3 0.01000000      3 0.7156878 0.7156878 0.008839641
## 
## Variable importance
##            rain_ind              gdp_pc religiosity_percent       contact_lgbti 
##                  27                  25                  14                  13 
##  self_determination                gini            religion 
##                  12                   7                   1 
## 
## Node number 1: 16203 observations,    complexity param=0.1886214
##   predicted class=Yes  expected loss=0.4024563  P(node) =1
##     class counts:  6521  9682
##    probabilities: 0.402 0.598 
##   left son=2 (8164 obs) right son=3 (8039 obs)
##   Primary splits:
##       rain_ind            < 50.975   to the left,  improve=983.5302, (0 missing)
##       contact_lgbti       splits as  LR,           improve=873.5439, (0 missing)
##       gdp_pc              < 9.93729  to the left,  improve=872.3062, (0 missing)
##       self_determination  splits as  LR,           improve=401.0324, (0 missing)
##       religiosity_percent < 0.79145  to the right, improve=380.2319, (0 missing)
##   Surrogate splits:
##       gdp_pc              < 9.93729  to the left,  agree=0.864, adj=0.726, (0 split)
##       self_determination  splits as  LR,           agree=0.744, adj=0.485, (0 split)
##       religiosity_percent < 0.79145  to the right, agree=0.713, adj=0.422, (0 split)
##       contact_lgbti       splits as  LR,           agree=0.683, adj=0.361, (0 split)
##       gini                < 34.7     to the right, agree=0.602, adj=0.198, (0 split)
## 
## Node number 2: 8164 observations,    complexity param=0.04784542
##   predicted class=No   expected loss=0.4246693  P(node) =0.5038573
##     class counts:  4697  3467
##    probabilities: 0.575 0.425 
##   left son=4 (1771 obs) right son=5 (6393 obs)
##   Primary splits:
##       gdp_pc              < 9.710632 to the left,  improve=256.93210, (0 missing)
##       contact_lgbti       splits as  LR,           improve=226.72940, (0 missing)
##       rain_ind            < 33.575   to the left,  improve=133.04940, (0 missing)
##       religiosity_percent < 0.9509   to the right, improve= 91.92237, (0 missing)
##       gini                < 37.85    to the right, improve= 82.53612, (0 missing)
##   Surrogate splits:
##       religiosity_percent < 0.9509   to the right, agree=0.888, adj=0.486, (0 split)
##       rain_ind            < 23.595   to the left,  agree=0.855, adj=0.332, (0 split)
##       gini                < 37.85    to the right, agree=0.846, adj=0.290, (0 split)
##       religion            splits as  RRRLR,        agree=0.827, adj=0.202, (0 split)
## 
## Node number 3: 8039 observations
##   predicted class=Yes  expected loss=0.2268939  P(node) =0.4961427
##     class counts:  1824  6215
##    probabilities: 0.227 0.773 
## 
## Node number 4: 1771 observations
##   predicted class=No   expected loss=0.1863354  P(node) =0.1093007
##     class counts:  1441   330
##    probabilities: 0.814 0.186 
## 
## Node number 5: 6393 observations,    complexity param=0.04784542
##   predicted class=No   expected loss=0.4906929  P(node) =0.3945566
##     class counts:  3256  3137
##    probabilities: 0.509 0.491 
##   left son=10 (4539 obs) right son=11 (1854 obs)
##   Primary splits:
##       contact_lgbti splits as  LR,           improve=164.71390, (0 missing)
##       gini          < 24.2     to the left,  improve= 47.21759, (0 missing)
##       rain_ind      < 32.625   to the left,  improve= 34.62362, (0 missing)
##       occupation    splits as  RLRRR,        improve= 33.37894, (0 missing)
##       gdp_pc        < 9.887161 to the left,  improve= 33.29937, (0 missing)
##   Surrogate splits:
##       occupation splits as  LLLRL, agree=0.714, adj=0.014, (0 split)
## 
## Node number 10: 4539 observations
##   predicted class=No   expected loss=0.4181538  P(node) =0.2801333
##     class counts:  2641  1898
##    probabilities: 0.582 0.418 
## 
## Node number 11: 1854 observations
##   predicted class=Yes  expected loss=0.3317152  P(node) =0.1144233
##     class counts:   615  1239
##    probabilities: 0.332 0.668
rpart.plot(dtFit, digits=3)

dtPred <- predict(dtFit, testing2, type = "class")

dtProb <- predict(dtFit, testing2, type = "prob")

prediction <- as.factor(ifelse(dtProb[,2] > 0.5, "Yes", "No"))

testing2$trans_doc <- factor(testing2$trans_doc)
test_country_fi$trans_doc <- factor(test_country_fi$trans_doc)

confusionMatrix(prediction, testing2$trans_doc)$table
##           Reference
## Prediction   No  Yes
##        No  1798  943
##        Yes 1313 3800
confusionMatrix(prediction, testing2$trans_doc)$overall[1:2]
##  Accuracy     Kappa 
## 0.7127578 0.3870495
caret.fit <- train(model, 
                   data = training2, 
                   method = "rpart",
                   control=rpart.control(minsplit = 8, maxdepth = 12),
                   trControl = ctrl,
                   tuneLength=10)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
## + Fold1.Rep1: cp=0.0006901 
## - Fold1.Rep1: cp=0.0006901 
## + Fold2.Rep1: cp=0.0006901 
## - Fold2.Rep1: cp=0.0006901 
## + Fold3.Rep1: cp=0.0006901 
## - Fold3.Rep1: cp=0.0006901 
## + Fold4.Rep1: cp=0.0006901 
## - Fold4.Rep1: cp=0.0006901 
## + Fold5.Rep1: cp=0.0006901 
## - Fold5.Rep1: cp=0.0006901 
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.00069 on full training set
caret.fit
## CART 
## 
## 16203 samples
##    14 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 12963, 12962, 12963, 12962, 12962 
## Resampling results across tuning parameters:
## 
##   cp            ROC        Sens       Spec     
##   0.0006900782  0.7630739  0.6048188  0.7926055
##   0.0007667536  0.7618356  0.5925504  0.8012800
##   0.0008178705  0.7617616  0.5923971  0.8021064
##   0.0009201043  0.7608190  0.5922416  0.8006618
##   0.0013801564  0.7592361  0.5899438  0.8003511
##   0.0019424424  0.7593294  0.6025199  0.7937430
##   0.0024536114  0.7572079  0.6249126  0.7777333
##   0.0035781833  0.7434340  0.6399357  0.7636847
##   0.0478454225  0.6981598  0.6801229  0.6929325
##   0.1886213771  0.5698110  0.2839402  0.8556818
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0006900782.
rpart.plot(caret.fit$finalModel)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

dtProb <- predict(caret.fit, testing2, type = "prob")
dtProb_fi <- predict(caret.fit, test_country_fi, type = "prob")

prediction <- as.factor(ifelse(dtProb[,2] > 0.5, "Yes", "No"))
prediction_fi <- as.factor(ifelse(dtProb_fi[,2] > 0.5, "Yes", "No"))

confusionMatrix(prediction, testing2$trans_doc)$table
##           Reference
## Prediction   No  Yes
##        No  1794  911
##        Yes 1317 3832
confusionMatrix(prediction, testing2$trans_doc)$overall[1:2]
##  Accuracy     Kappa 
## 0.7163229 0.3934242
confusionMatrix(prediction_fi, test_country_fi$trans_doc)$table
##           Reference
## Prediction  No Yes
##        No    0   3
##        Yes 256 624
confusionMatrix(prediction_fi, test_country_fi$trans_doc)$overall[1:2]
##     Accuracy        Kappa 
##  0.706681767 -0.006761724

By using decision tree, besides getting a better prediction than in the previous models, we can also observe how the variables are used to classified the observations.

Random Forest

rfFit <- train(trans_doc ~ ., 
               data = training2,
               method = "rf",   
               preProc=c('scale','center'),
               tuneLength = 10,
               metric="ROC",
               trControl = ctrl)
## + Fold1.Rep1: mtry= 2 
## - Fold1.Rep1: mtry= 2 
## + Fold1.Rep1: mtry= 4 
## - Fold1.Rep1: mtry= 4 
## + Fold1.Rep1: mtry= 6 
## - Fold1.Rep1: mtry= 6 
## + Fold1.Rep1: mtry= 9 
## - Fold1.Rep1: mtry= 9 
## + Fold1.Rep1: mtry=11 
## - Fold1.Rep1: mtry=11 
## + Fold1.Rep1: mtry=14 
## - Fold1.Rep1: mtry=14 
## + Fold1.Rep1: mtry=16 
## - Fold1.Rep1: mtry=16 
## + Fold1.Rep1: mtry=19 
## - Fold1.Rep1: mtry=19 
## + Fold1.Rep1: mtry=21 
## - Fold1.Rep1: mtry=21 
## + Fold1.Rep1: mtry=24 
## - Fold1.Rep1: mtry=24 
## + Fold2.Rep1: mtry= 2 
## - Fold2.Rep1: mtry= 2 
## + Fold2.Rep1: mtry= 4 
## - Fold2.Rep1: mtry= 4 
## + Fold2.Rep1: mtry= 6 
## - Fold2.Rep1: mtry= 6 
## + Fold2.Rep1: mtry= 9 
## - Fold2.Rep1: mtry= 9 
## + Fold2.Rep1: mtry=11 
## - Fold2.Rep1: mtry=11 
## + Fold2.Rep1: mtry=14 
## - Fold2.Rep1: mtry=14 
## + Fold2.Rep1: mtry=16 
## - Fold2.Rep1: mtry=16 
## + Fold2.Rep1: mtry=19 
## - Fold2.Rep1: mtry=19 
## + Fold2.Rep1: mtry=21 
## - Fold2.Rep1: mtry=21 
## + Fold2.Rep1: mtry=24 
## - Fold2.Rep1: mtry=24 
## + Fold3.Rep1: mtry= 2 
## - Fold3.Rep1: mtry= 2 
## + Fold3.Rep1: mtry= 4 
## - Fold3.Rep1: mtry= 4 
## + Fold3.Rep1: mtry= 6 
## - Fold3.Rep1: mtry= 6 
## + Fold3.Rep1: mtry= 9 
## - Fold3.Rep1: mtry= 9 
## + Fold3.Rep1: mtry=11 
## - Fold3.Rep1: mtry=11 
## + Fold3.Rep1: mtry=14 
## - Fold3.Rep1: mtry=14 
## + Fold3.Rep1: mtry=16 
## - Fold3.Rep1: mtry=16 
## + Fold3.Rep1: mtry=19 
## - Fold3.Rep1: mtry=19 
## + Fold3.Rep1: mtry=21 
## - Fold3.Rep1: mtry=21 
## + Fold3.Rep1: mtry=24 
## - Fold3.Rep1: mtry=24 
## + Fold4.Rep1: mtry= 2 
## - Fold4.Rep1: mtry= 2 
## + Fold4.Rep1: mtry= 4 
## - Fold4.Rep1: mtry= 4 
## + Fold4.Rep1: mtry= 6 
## - Fold4.Rep1: mtry= 6 
## + Fold4.Rep1: mtry= 9 
## - Fold4.Rep1: mtry= 9 
## + Fold4.Rep1: mtry=11 
## - Fold4.Rep1: mtry=11 
## + Fold4.Rep1: mtry=14 
## - Fold4.Rep1: mtry=14 
## + Fold4.Rep1: mtry=16 
## - Fold4.Rep1: mtry=16 
## + Fold4.Rep1: mtry=19 
## - Fold4.Rep1: mtry=19 
## + Fold4.Rep1: mtry=21 
## - Fold4.Rep1: mtry=21 
## + Fold4.Rep1: mtry=24 
## - Fold4.Rep1: mtry=24 
## + Fold5.Rep1: mtry= 2 
## - Fold5.Rep1: mtry= 2 
## + Fold5.Rep1: mtry= 4 
## - Fold5.Rep1: mtry= 4 
## + Fold5.Rep1: mtry= 6 
## - Fold5.Rep1: mtry= 6 
## + Fold5.Rep1: mtry= 9 
## - Fold5.Rep1: mtry= 9 
## + Fold5.Rep1: mtry=11 
## - Fold5.Rep1: mtry=11 
## + Fold5.Rep1: mtry=14 
## - Fold5.Rep1: mtry=14 
## + Fold5.Rep1: mtry=16 
## - Fold5.Rep1: mtry=16 
## + Fold5.Rep1: mtry=19 
## - Fold5.Rep1: mtry=19 
## + Fold5.Rep1: mtry=21 
## - Fold5.Rep1: mtry=21 
## + Fold5.Rep1: mtry=24 
## - Fold5.Rep1: mtry=24 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 2 on full training set
plot(rfFit)

rfProb <- predict(rfFit, testing2, type="prob")
rfProb_fi <- predict(rfFit, test_country_fi, type = "prob")

prediction <- as.factor(ifelse(rfProb[,2] > 0.5, "Yes", "No"))
prediction_fi <- as.factor(ifelse(rfProb_fi[,2] > 0.2, "Yes", "No"))

confusionMatrix(prediction, testing2$trans_doc)$table
##           Reference
## Prediction   No  Yes
##        No  1767  872
##        Yes 1344 3871
confusionMatrix(prediction, testing2$trans_doc)$overall[1:2]
##  Accuracy     Kappa 
## 0.7178508 0.3944302
confusionMatrix(prediction_fi, test_country_fi$trans_doc)$table
## Warning in confusionMatrix.default(prediction_fi, test_country_fi$trans_doc):
## Levels are not in the same order for reference and data. Refactoring data to
## match.
##           Reference
## Prediction  No Yes
##        No    0   0
##        Yes 256 627
confusionMatrix(prediction_fi, test_country_fi$trans_doc)$overall[1:2]
## Warning in confusionMatrix.default(prediction_fi, test_country_fi$trans_doc):
## Levels are not in the same order for reference and data. Refactoring data to
## match.
##  Accuracy     Kappa 
## 0.7100793 0.0000000

The Random Forest model delivers the best results in terms of accuracy and balance between sensitivity and specificity. However, in the specific case of Finland, it performs poorly in predicting opposition to the gender-change measure.

Paradata

Finally, for an extra analysis we will analyze the paradata from the survey.

# read data
survey <- read_dta("data/ZA7575.dta")

# Load variables 
paradata <- survey %>% 
  select(c(country, isocntry, qc19, d60, p3, p4, p5))
#p3 -> minutes the interview lasted
#p4 -> Number of persons present during the interview
#p5 -> respondent cooperation
# Transform data in order to work correctly with the variables
paradata <- paradata %>% 
  mutate(
    trans_name_na = if_else(qc19 == 3, 1, 0), 
    difficulties_bills = as.factor(if_else(d60 == 4, NA, d60)),
    int_minutes = p3, 
    int_persons = as.factor(p4), 
    int_cooperation = as.factor(p5) 
  ) 

# Country names: DE-E and DE-W were problems 
paradata <- paradata |>   
  mutate(country = country_name(isocntry, to = "simple", 
                                verbose = TRUE, poor_matches = TRUE),
         isocntry = countrycode::countrycode(
           country, origin = "country.name", destination = "iso2c"))
## 
## In total 29 unique country names were provided
## 27/29 have been matched with EXACT matching
## 2/29 have been matched with FUZZY matching, out of which:
## 2/2 are a POOR match (likely wrongly identified)
## 
## 
## Multiple arguments have been matched to the same country name:
##   - DE-E : Germany 
##   - DE-W : Germany
## 
## The matching for the following countries is likely inaccurate:
## (Set - poor_matches - to FALSE to return NAs and - na_fill - to replace the NAs with the original name in - x)
##  - DE-E : Germany 
##  - DE-W : Germany
# NAs by country: Here we are creating a table with the proportion of NA's in the target variable in every country
na_country <- paradata %>% 
  group_by(country, isocntry) %>% 
  summarise(
    na_target = sum(trans_name_na), 
    total_values = n()
  ) %>% 
  mutate(
    na_proportion = (na_target / total_values)
  ) 
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
# Here we are creating a table with the proportion of response NO in the target variable in every country
na_country_plot <- na_country %>%
  left_join(paradata %>% filter(qc19 == 2) %>%
              group_by(country) %>%
              summarise(count_qc19_no = n()) %>%
              ungroup(),
            by = "country") %>%
  mutate(
    proportion_qc19_no = count_qc19_no / total_values 
  )
# Tables, plots --------------------------------------------

p <- ggplot(na_country, 
            aes(x = reorder(country, -na_proportion), 
                y = na_proportion, 
                fill = na_proportion)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = percent(na_proportion, accuracy = 1)), 
            position = position_stack(vjust = 0.5), 
            color = "white", 
            size = 2) +
  labs(x = "", y = "Proportion of NA", 
       title = "Proportion of NA in Qc_19 by Country", 
       subtitle = "Ordered by Proportion",
       x = "country",
       y = "% of missing values") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 12),
        legend.position = "none") +
  guides(fill = guide_legend(reverse = TRUE))

ggplotly(p, tooltip = "text")

This chart illustrates the proportion of missing values (NA) in the target variable across different countries. We notice that Eastern European countries (like Bulgaria, Poland, and Slovakia) tend to have higher NA rates, while Western European countries (like Belgium, Austria, and the Netherlands) have lower ones.

Notably, it appears to be a potential relationship between the proportion of missing data and levels of support for the trans movement, as countries with higher NA rates are often those with lower reported support. To assess this possible correlation, we will conduct further analysis using a scatterplot.

# Relationship between NA's and No responses in Qc19 by country
p <- ggplot(na_country_plot, aes(x = na_proportion, y = proportion_qc19_no, text = country)) +
  geom_point(aes(color = proportion_qc19_no), size = 3, shape = 21, stroke = 1, fill = "white") +  
  geom_smooth(method = "lm", se = FALSE, color = "red3", linewidth = 1.5, linetype = "solid") +  
  scale_color_gradientn(colors = "darkblue") +  
  labs(
    title = "Relationship between NA's and No responses in Qc19 by country",
    x = "NA Proportion",
    y = "No Responses Proportion"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    legend.position = "none",
    panel.grid.major = element_line(color = "grey80", linetype = "dashed"),
    panel.grid.minor = element_blank()
  )


lm_model <- lm(proportion_qc19_no ~ na_proportion, data = na_country_plot)

regression_line <- data.frame(
  na_proportion = seq(min(na_country_plot$na_proportion), max(na_country_plot$na_proportion), length.out = 100),
  proportion_qc19_no = predict(lm_model, newdata = data.frame(na_proportion = seq(min(na_country_plot$na_proportion), max(na_country_plot$na_proportion), length.out = 100)))
)

ggplotly(p, tooltip = "text") %>%
  add_lines(data = regression_line, x = ~na_proportion, y = ~proportion_qc19_no, 
            line = list(color = 'orange', width = 1), name = 'Regression line')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: 'scatter' objects don't have these attributes: 'colour'
## Valid attributes include:
## 'cliponaxis', 'connectgaps', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'fill', 'fillcolor', 'fillpattern', 'groupnorm', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hoveron', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'orientation', 'selected', 'selectedpoints', 'showlegend', 'stackgaps', 'stackgroup', 'stream', 'text', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

The scatterplot suggests a positive correlation between NA proportion and “No” responses to the question on transgender rights. This could indicate that in countries with more missing responses, there is also greater reluctance to support legal gender changes. Possible explanations include social desirability bias, where respondents prefer to skip rather than explicitly oppose, lack of awareness or engagement with the issue, or cultural and political factors that make the topic more sensitive.

# Model ---------------------------------------------------
model <- glm(trans_name_na ~ 
               difficulties_bills + 
               int_minutes + 
               int_persons + 
               int_cooperation, 
             data = paradata, 
             family = "binomial")
summary(model)
## 
## Call:
## glm(formula = trans_name_na ~ difficulties_bills + int_minutes + 
##     int_persons + int_cooperation, family = "binomial", data = paradata)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.3515734  0.0817324 -28.772  < 2e-16 ***
## difficulties_bills2  0.0972786  0.0798628   1.218  0.22320    
## difficulties_bills3  0.1650396  0.0743351   2.220  0.02640 *  
## difficulties_bills7  1.0504731  0.1369227   7.672 1.69e-14 ***
## int_minutes         -0.0007093  0.0008177  -0.867  0.38575    
## int_persons2        -0.0353521  0.0559956  -0.631  0.52782    
## int_persons3        -0.4877461  0.1761923  -2.768  0.00564 ** 
## int_persons4        -0.3593376  0.3341440  -1.075  0.28220    
## int_cooperation2     0.4251679  0.0416496  10.208  < 2e-16 ***
## int_cooperation3     0.7156940  0.0611792  11.698  < 2e-16 ***
## int_cooperation4     1.4561652  0.1235217  11.789  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20085  on 27437  degrees of freedom
## Residual deviance: 19745  on 27427  degrees of freedom
## AIC: 19767
## 
## Number of Fisher Scoring iterations: 5

This model predicts the probability of having a NA in trans_name_na based on factors such as payment difficulties, interview duration, number of people present, and respondent cooperation. For payment difficulties, severe difficulties (difficulties_bills7) significantly increase the probability of not responding to the trans_doc question.

However, other categories of difficulties (difficulties_bills2 and difficulties_bills3) do not show strong effects, although difficulties_bills3 suggests a slight increase in probability.The interview time has no significant effect on the probability of trans_doc being NA, as its p-value is high. Similarly, the number of people present does not show a clear effect, though having three people present (int_persons3) decreases the probability of trans_name_na being NA.

Finally, respondent cooperation has a positive and significant effect on the probability of not responding to the qc_19 question. The more cooperative the respondent, the higher the probability that trans_name_na will be 1, with a particularly notable increase for higher levels of cooperation. In summary, the main variables significantly affecting the probability of trans_name_na being 1 are payment difficulties and respondent cooperation, while interview time and number of people present have no relevant effect.

Conclusions

data_comparaison <- data |> 
  filter(isocntry=="IT"| isocntry == "ES")

data_comparaison <- data_comparaison |> 
  mutate(female = case_when(
    female == "1" ~ 1,
    female == "0" ~ 0,
    TRUE ~ NA
  ))

ggplot(data_comparaison, aes(x = isocntry, fill = trans_doc)) + 
  geom_bar(position = "dodge") + 
  labs(title = "Difference in target variable between Spain and Italy",
       x = "Country",
       y = "Frequency") +  
  scale_fill_manual(values = c("0" = "red3", "1" = "olivedrab3"),
                    labels = c("No", "Yes")) +  
  theme_minimal() +
  theme(legend.title = element_blank())

In the previous graph we can observe that Spain support for transgender people to change their official documents to match their inner gender identity is much greater that the support given by Italian people. Now we will follow the results obtained in the explanatory to see if they help us understand the differences between these two countries.

summary_data <- data_comparaison %>%
  group_by(isocntry) %>%
  summarise(
    mean_age = mean(age, na.rm = TRUE),
    perc_female = mean(as.numeric(female), na.rm = TRUE),  
    mean_rain_ind = mean(rain_ind, na.rm = TRUE),
    mean_gdp_pc = mean(gdp_pc, na.rm = TRUE),
    mean_ideology = mean(ideology, na.rm = TRUE),
    gdp = mean(gdp_pc, na.rm = TRUE),
    rain_ind = mean(rain_ind, na.rm = TRUE),
    religiosity_percent = mean(religiosity_percent, na.rm = TRUE)
  )

summary_data
## # A tibble: 2 × 9
##   isocntry mean_age perc_female mean_rain_ind mean_gdp_pc mean_ideology   gdp
##   <chr>       <dbl>       <dbl>         <dbl>       <dbl>         <dbl> <dbl>
## 1 ES           50.8       0.511          76.4        10.3          4.37  10.3
## 2 IT           48.8       0.507          25.4        10.4          5.65  10.4
## # ℹ 2 more variables: rain_ind <dbl>, religiosity_percent <dbl>
summary_numeric_long <- summary_data %>%
  pivot_longer(cols = -isocntry, names_to = "variable", values_to = "value")

ggplot(summary_numeric_long, aes(x = isocntry, y = value, fill = isocntry)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ variable, scales = "free") +  
  labs(title = "Comparison of Means by Country",
       x = "Country",
       y = "Average Value") +
  scale_fill_manual(
    values = c(
      "ES" = "red4",
      "IT" = "darkgreen"
    ))+
  theme_minimal() +
  theme(legend.position = "none")

calculate_percentage <- function(data, var) {
  data %>%
    group_by(isocntry, !!sym(var)) %>%
    summarise(count = n(), .groups = "drop") %>%
    group_by(isocntry) %>%
    mutate(percentage = (count / sum(count)) * 100) %>%
    select(isocntry, !!sym(var), percentage)
}


summary_occupation <- calculate_percentage(data_comparaison, "occupation")
summary_religion <- calculate_percentage(data_comparaison, "religion")
summary_satisfaction <- calculate_percentage(data_comparaison, "personal_satis")
summary_contact_lgtbi <- calculate_percentage(data_comparaison, "contact_lgbti")

ggplot(summary_occupation, aes(x = isocntry, y = percentage, fill = occupation)) +
  geom_bar(stat = "identity", position = "fill", color = "black") +
  labs(title = "Distribution of occupation by country",
       x = "Country",
       y = "Percentage") +
  scale_fill_brewer(palette = "RdYlBu") +
  theme_minimal() +
  theme(
    strip.text = element_text(size = 14, face = "bold", family = "Roboto"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5, family = "Roboto", margin = margin(b = 20)),
    axis.text.x = element_text(size = 11, family = "Roboto", color = "black"),
    axis.text.y = element_text(size = 11, family = "Roboto", color = "black"),
    legend.title = element_text(size = 11, family = "Roboto", face = "bold"),
    legend.text = element_text(size = 10, family = "Roboto"),
    legend.box.background = element_rect(color = "black", size = 0.5),
    plot.margin = margin(15, 15, 15, 15),
    legend.key.size = unit(1, "lines")
  )
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

ggplot(summary_religion, aes(x = isocntry, y = percentage, fill = religion)) +
  geom_bar(stat = "identity", position = "fill", color = "black") +
  labs(title = "Distribution of religion by country",
       x = "Country",
       y = "Percentage") +
  scale_fill_brewer(palette = "RdYlBu") +
  theme_minimal() +
  theme(
    strip.text = element_text(size = 14, face = "bold", family = "Roboto"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5, family = "Roboto", margin = margin(b = 20)),
    axis.text.x = element_text(size = 11, family = "Roboto", color = "black"),
    axis.text.y = element_text(size = 11, family = "Roboto", color = "black"),
    legend.title = element_text(size = 11, family = "Roboto", face = "bold"),
    legend.text = element_text(size = 10, family = "Roboto"),
    legend.box.background = element_rect(color = "black", size = 0.5),
    plot.margin = margin(15, 15, 15, 15),
    legend.key.size = unit(1, "lines"))
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

ggplot(summary_satisfaction, aes(x = isocntry, y = percentage, fill = personal_satis)) +
  geom_bar(stat = "identity", position = "fill", color = "black") +
  labs(title = "Distribution of personal satisfaction by country",
       x = "Country",
       y = "Percentage") +
  scale_fill_manual(
    values = c(
      "Not satisfied" = "red3",
      "Satisfied" = "olivedrab3"
    ))+
  theme_minimal() +
  theme(
    strip.text = element_text(size = 14, face = "bold", family = "Roboto"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5, family = "Roboto", margin = margin(b = 20)),
    axis.text.x = element_text(size = 11, family = "Roboto", color = "black"),
    axis.text.y = element_text(size = 11, family = "Roboto", color = "black"),
    legend.title = element_text(size = 11, family = "Roboto", face = "bold"),
    legend.text = element_text(size = 10, family = "Roboto"),
    legend.box.background = element_rect(color = "black", size = 0.5),
    plot.margin = margin(15, 15, 15, 15),
    legend.key.size = unit(1, "lines"))
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

ggplot(summary_contact_lgtbi, aes(x = isocntry, y = percentage, fill = contact_lgbti)) +
  geom_bar(stat = "identity", position = "fill", color = "black") +
  labs(title = "Distribution of Contact LGBTI by country",
       x = "Country",
       y = "Percentage") +
  scale_fill_manual(
    values = c(
      "No contact" = "red3",
      "Contact" = "olivedrab3"
    ))+
  theme_minimal() +
  theme(
    strip.text = element_text(size = 14, face = "bold", family = "Roboto"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5, family = "Roboto", margin = margin(b = 20)),
    axis.text.x = element_text(size = 11, family = "Roboto", color = "black"),
    axis.text.y = element_text(size = 11, family = "Roboto", color = "black"),
    legend.title = element_text(size = 11, family = "Roboto", face = "bold"),
    legend.text = element_text(size = 10, family = "Roboto"),
    legend.box.background = element_rect(color = "black", size = 0.5),
    plot.margin = margin(15, 15, 15, 15),
    legend.key.size = unit(1, "lines"))
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Based on these results, we can conclude that, on average, Italians have less contact with individuals from the LGBTI community, tend to be more right-leaning, and report lower personal satisfaction. At the country level, Italy exhibits a higher level of religiosity and a lower Rainbow Indicator compared to Spain. In other aspects, both countries share similar characteristics. Given these factors, our explanatory model effectively accounts for the differences observed between Italy and Spain, demonstrating its validity in this context.